ENH hypothesis tests, confint for multinomial proportions (exact)

companion to issue #2895

Cai, Yong, and K. Krishnamoorthy. "Exact size and power properties of five tests for multinomial proportions." Communications in Statistics—Simulation and Computation® 35, no. 1 (2006): 149-160. (only 7 citation in Google scholar, but looks good to me.)

The probability space for multinomial distribution is too large for exact calculation except in very small samples. For box/interval probabilities there are some faster algorithms that don't look easy to implement. (At least I didn't try to figure out the details of the algorithms in two recent articles.)

As alternatives:

The references in #2895 Sison and Glaz and by May and Johnson use an approximation that is based on truncated Poisson and Edgeworth expansion. My implementation works for some examples but not yet for others. It calculates box probabilities and confint are appropriate if probabilities are roughly the same across bins.
Monte Carlo probabilities: That's much easier to implement and seems to work very well, and is not very slow. It is also more general than just having box probabilities and can be used for p-values based on some test statistic. Cai and Krishnamoorthy use exact calculation for small samples based on chisquare test, and recommend Monte Carlo for larger samples. In the small sample the Monte Carlo also agrees at about 3 decimals with the exact p-values when using 100,000 replications.

Related: They also have "Nass test" which is the chisquare test with corrected distribution (scaled chisquare with adjusted degrees of freedom) which is doing very well in small, but not tiny, samples. (small in multinomial chisquare test refers to expected number of observation in each bin). Also, compared to binomial proportion both standard chisquare test and exact test work better if there are more bins, with smaller liberal resp. conservative deviation from size. LR (Q) test has a quite distored size.

Status I wrote some function that mostly work, I'm using a semi-generic function to calculate multinomial probabilities by simulation based on an indicator callback function.

(I haven't looked yet what R packages are doing in this area.)

statsmodels / statsmodels

ENH hypothesis tests, confint for multinomial proportions (exact) #2931