Bug in ANOVA 2-ways? - Githubissues

giacomo-carugati commented 4 months ago

Salve soldati; preso da una malsana euforia, alla visione del seguente quesito, ricorrente negli esercizi su (M)ANOVA, mi son messo a ricavare manualmente le Sum of Squares della decomposizione della varianza, al fine di performare manualmente i test presenti nel summary.

Notando però che i risultati per il secondo treatment e l'interazione coi treatments (non per il treatment 1 e i residui) non combaciavano con quelli proposti nel summary, ho indagato più a fondo e ho scoperto che la funzione aov calcola le medie sui vari treatments in maniera inspiegabile.

A titolo d'esempio, questo è il codice che ho usato per la media sul secondo treatment: https://github.com/giuseppeegentile/AppStatExams/blob/bd2af3e0436c83247ffcf8c96b5bb1058ccd1ad5/2022/20220909/Pb1.R#L158 laddove treatment2 è stato ricavato nella seguente maniera: https://github.com/giuseppeegentile/AppStatExams/blob/bd2af3e0436c83247ffcf8c96b5bb1058ccd1ad5/2022/20220909/Pb1.R#L19

Il risultato è: https://github.com/giuseppeegentile/AppStatExams/blob/bd2af3e0436c83247ffcf8c96b5bb1058ccd1ad5/2022/20220909/Pb1.R#L296-L297

Con la chiamata a funzione model.tables(fit.aov2.int, type = "means")$tables$treatment2, ho estratto quelle ricavate da aov. Il risultato è: https://github.com/giuseppeegentile/AppStatExams/blob/bd2af3e0436c83247ffcf8c96b5bb1058ccd1ad5/2022/20220909/Pb1.R#L300-L301

Non ho trovato spiegazione a questo comportamento. Le cosa che si può notare è che siamo in un caso 2-ways con un design non bilanciato. Riporto i risultati completi:

https://github.com/giuseppeegentile/AppStatExams/blob/bd2af3e0436c83247ffcf8c96b5bb1058ccd1ad5/2022/20220909/Pb1.R#L287-L318

N.B. Per l'interazione ho fatto un fit con treatment1:treatment2 e uno con treatments; il summary è lo stesso, ma si può notare come nel primo caso le medie siano corrette, mentre nel secondo no (basti vedere il treatment OmnJur, di cui abbiamo solo 3 dati, e farne la media)

ScarpMarc commented 4 months ago

Arrivo eh sto aggiornando i miei template perchè puntualmente Box-Cox mi manca e mentre faccio sistemo

ScarpMarc commented 4 months ago

Ti confermo che ho lo stesso comportamento strano. Noto che la funzione model.tables, che peraltro non conoscevo, riporta la media totale giusta ed in effetti quello che fai con tapply mi torna se faccio a mano mean(predicted_v[which(factor_2==levels(factor_2)[1])]). Adesso ci smanetto per bene

giacomo-carugati commented 4 months ago

model.tables scoperta grazie ad una proficua conversazione con ChatGPT

ScarpMarc commented 4 months ago

Io sto avendo una proficua conversazione di StackOverflow Dumpo qua il testo di model.tables così nel mentre provo a capirlo

function (x, type = "effects", se = FALSE, cterms, ...) 
{
    if (inherits(x, "maov")) 
        stop("'model.tables' is not implemented for multiple responses")
    type <- match.arg(type, c("effects", "means", "residuals"))
    if (type == "residuals") 
        stop(gettextf("type '%s' is not implemented yet", type), 
            domain = NA)
    prjs <- proj(x, unweighted.scale = TRUE)
    if (is.null(x$call)) 
        stop("this fit does not inherit from \"lm\"")
    mf <- model.frame(x)
    factors <- attr(prjs, "factors")
    nf <- names(factors)
    dn.proj <- setNames(as.list(nf), nf)
    m.factors <- factors
    t.factor <- attr(prjs, "t.factor")
    vars <- colnames(t.factor)
    which <- match(vars, names(dn.proj))
    which <- which[!is.na(which)]
    dn.proj <- dn.proj[which]
    m.factors <- m.factors[which]
    if (!missing(cterms)) {
        if (anyNA(match(cterms, names(factors)))) 
            stop("'cterms' argument must match terms in model object")
        dn.proj <- dn.proj[cterms]
        m.factors <- m.factors[cterms]
    }
    if (type == "means") {
        dn.proj <- lapply(dn.proj, function(x, mat, vn) c("(Intercept)", 
            vn[(t(mat) %*% (as.logical(mat[, x]) - 1)) == 0]), 
            t.factor, vars)
    }
    tables <- make.tables.aovproj(dn.proj, m.factors, prjs, mf)
    n <- NULL
    for (xx in names(tables)) n <- c(n, replications(paste("~", 
        xx), data = mf))
    if (se) 
        if (is.list(n)) {
            message("Design is unbalanced - use se.contrast() for se's")
            se <- FALSE
        }
        else se.tables <- se.aov(x, n, type = type)
    if (type == "means" && "(Intercept)" %in% colnames(prjs)) {
        gmtable <- mean(prjs[, "(Intercept)"])
        class(gmtable) <- "mtable"
        tables <- c(`Grand mean` = gmtable, tables)
    }
    result <- list(tables = tables, n = n)
    if (se) 
        result$se <- se.tables
    attr(result, "type") <- type
    class(result) <- c("tables_aov", "list.of")
    result
}

ScarpMarc commented 4 months ago

Ok allora la funzione chiama

make.tables.aovproj <-
  function(proj.cols, mf.cols, prjs, mf, fun = "mean", prt = FALSE, ...)
  {
    tables <- setNames(vector("list", length(proj.cols)), names(proj.cols))
    for(i in seq_along(tables)) {
      terms <- proj.cols[[i]]
      terms <- terms[terms %in% colnames(prjs)]
      data <-
        if(length(terms) == 1L) prjs[, terms]
      else prjs[, terms] %*% as.matrix(rep.int(1, length(terms)))
      tables[[i]] <- tapply(data, mf[mf.cols[[i]]],
                            get(fun, mode="function"))
      class(tables[[i]]) <- "mtable"
      if(prt) print(tables[i], ..., quote = FALSE)
    }
    tables
  }

e secondo me qui è il problema, ora cerco di capirlo meglio e ti dico ma ho un'idea

ScarpMarc commented 4 months ago

(Nota: se vedi FUNC_ è un prefisso che ho messo io alle variabili interne della funzione per runnare a mano)

Allora forse e dico FORSE e dico FORSE potrebbe essere una questione di errori numerici? terms è così in questa ultima funzione utilizzando i=2, cioè il secondo fattore:

> terms [1] 
"(Intercept)" "factor_2"

E abbiamo che (EDIT: questi numerini sono la proiezione dei dati sui termini del nostro modello ANOVA):

> FUNC_prjs[, terms] 
    (Intercept)    factor_2
1      2.001893  0.14043652
2      2.001893 -0.07038641
3      2.001893 -0.06586846
4      2.001893 -0.07038641
5      2.001893 -0.07038641
6      2.001893 -0.06586846
7      2.001893 -0.07038641
8      2.001893  0.14043652
9      2.001893 -0.06586846
10     2.001893 -0.06586846
11     2.001893 -0.07038641
...

A questo punto lui prende questa roba e la moltiplica per un vettore [1,1] (sarebbe la riga dove fa if-else), per cui in teoria somma per riga, difatti data è così:

1   2.142330
2   1.931507
3   1.936025
4   1.931507
5   1.931507
6   1.936025
7   1.931507
8   2.142330
9   1.936025
10  1.936025
...

Ed infine lui fa tapply(data, FUNC_mf[FUNC_m.factors[[i]]], mean) dove

> FUNC_mf[FUNC_m.factors[[i]]]
             factor_2
1     Period-Jurassic
2   Period-Cretaceous
3   Period-Cretaceous
4   Period-Cretaceous
5   Period-Cretaceous
6   Period-Cretaceous
7   Period-Cretaceous
8     Period-Jurassic
9   Period-Cretaceous
10  Period-Cretaceous
11  Period-Cretaceous
12  Period-Cretaceous
13  Period-Cretaceous
14    Period-Jurassic

Ed esce fuori quel numerino sbagliato.

ScarpMarc commented 4 months ago

Forse ha senso, facendo help(proj), cioè il primo comando della funzione principale, dice:

proj returns a matrix or list of matrices giving the projections of the data onto the terms of a linear model. It is most frequently used for aov models.

Ed anche:

Each projection is a matrix with a row for each observations and either a column for each term (onedf = FALSE) or for each coefficient (onedf = TRUE). Projection matrices from the default method have orthogonal columns representing the projection of the response onto the column space of the Q matrix from the QR decomposition. The fitted values are the sum of the projections, and the sum of squares for each column is the reduction in sum of squares from fitting that column (after those to the left of it).

Per cui mi vien da pensare che questa discrepanza sia data da qualche errore numerico.

Update: in effetti resta da capire come mai questa cosa succeda solo con il factor 2 e non con l'1...

ScarpMarc commented 4 months ago

Rimarrei comunque su quella tesi. Se guardo queste proiezioni con il fattore 1, ho:

> FUNC_prjs[, FUNC_dn.proj[[1]]]
    (Intercept)   factor_1
1      2.001893  0.3165435
2      2.001893 -0.3894968
3      2.001893  0.3165435
4      2.001893 -0.3894968
5      2.001893 -0.3894968
6      2.001893  0.3165435
7      2.001893 -0.3894968
8      2.001893  0.3165435
9      2.001893  0.3165435
10     2.001893  0.3165435
11     2.001893 -0.3894968
12     2.001893 -0.3894968
13     2.001893 -0.3894968
14     2.001893 -0.3894968
15     2.001893 -0.3894968
16     2.001893  0.3165435
17     2.001893  0.3165435
18     2.001893  0.3165435
19     2.001893  0.3165435
...

Mentre con il fattore 2:

> FUNC_prjs[, FUNC_dn.proj[[2]]]
    (Intercept)    factor_2
1      2.001893  0.14043652
2      2.001893 -0.07038641
3      2.001893 -0.06586846
4      2.001893 -0.07038641
5      2.001893 -0.07038641
6      2.001893 -0.06586846
7      2.001893 -0.07038641
8      2.001893  0.14043652
9      2.001893 -0.06586846
10     2.001893 -0.06586846
11     2.001893 -0.07038641
12     2.001893 -0.07038641
13     2.001893 -0.07038641
14     2.001893  0.13591858
15     2.001893 -0.07038641
16     2.001893 -0.06586846
17     2.001893  0.14043652
18     2.001893 -0.06586846
19     2.001893  0.14043652

E qui si vede che in effetti se facciamo una flatten di questa roba, nel primo caso la seconda colonna ha un certo impatto sulla somma per riga, mentre nel secondo caso ci sono valori minuscoli. In poche parole penso semplicemente che abbiamo beccato un caso fortuito in cui la proiezione QR (non ho idea di perchè o percome la faccia eh, parole dell'help) fa schifo.

giacomo-carugati commented 4 months ago

E abbiamo che (questi sono i parametri dell'ANOVA?):

> FUNC_prjs[, terms] 
    (Intercept)    factor_2
1      2.001893  0.14043652
2      2.001893 -0.07038641
3      2.001893 -0.06586846
4      2.001893 -0.07038641
5      2.001893 -0.07038641
6      2.001893 -0.06586846
7      2.001893 -0.07038641
8      2.001893  0.14043652
9      2.001893 -0.06586846
10     2.001893 -0.06586846
11     2.001893 -0.07038641
...

Io noterei qui le prime cose strane: ti direi che sì, questi sono i parametri dell'ANOVA. Problema: riferendomi alla colonna di _factor2, ne vedo (almeno) 3, mentre b=2 e i parametri corretti dovrebbero essere quelli dati da tapply(target, treatment2, mean) - mean(target), ovvero -0.07867111 per Cretaceous e 0.17492752 per Jurassic.

La tua conclusione sull'errore numerico e il caso fortuito della decomposizione QR direi che è soddisfacente; ad elaborare oltre ci sarebbe da diventar matti. Magari con ulteriori controlli per quando si ha tempo si può pensare di riportare il possibile bug a chi di dovere.

ScarpMarc commented 4 months ago

Sì è vero, in effetti non combacia con quello che dovrebbe venire (vado a memoria di ieri, ci sono 8 valori unici nella colonna) e sì pensiamoci più avanti, già di base io faccio tutto dal summary ahahah

giuseppeegentile / AppStatExams

Bug in ANOVA 2-ways? #27