Closed jtufto closed 1 year ago
Thanks for reaching out! I’ll look into it.
I think I found a solution, but I'm curious to hear from your perspective if I did it statistically right.
The calculation of the statistic was:
STATISTIC <- 2 * sum(x * log(x / E), na.rm = TRUE)
# (for stats::chisq.test() this is: sum((abs(x - E) - YATES)^2/E)
where E is the matrix:
#> [,1] [,2]
#> [1,] 1.4 0.6
#> [2,] 5.6 2.4
# to recreate:
E <- matrix(c(0.3566749, -0.1133287, -Inf, 0.2231436), nrow = 2)
So then removing NAs from the calculation returns a statistic, but the log is of course not working for the -Inf
:
# previously
2 * sum(x * log(x / E))
#> [1] NaN
#> Warning message:
#> In log(x/E) : NaNs produced
# with na.rm = TRUE added
2 * sum(x * log(x / E), na.rm = TRUE)
#> [1] 22.48762
#> Warning message:
#> In log(x/E) : NaNs produced
So can we determine the statistic this way, or should we cope a different way with the (negative) infinite value in E?
Yes, that should work I think. The LRT statistic involves the maximum likelihood under a saturated model where all the probabilities are estimated as $\hat p{ij}=x{ij}/n$. The likelihood contribution from cells with zero counts is therefore $\hat p{ij}^{x{ij}}=0^0=1$ and the log-likelihood contribution is $\log(0^0)=\log(1)=0$ so just removing the term when you get NaN
should do the trick.
The g.test function in your package fails when some of the counts are zero. Example: