markrogoyski / math-php

Powerful modern math library for PHP: Features descriptive statistics and regressions; Continuous and discrete probability distributions; Linear algebra with matrices and vectors, Numerical analysis; special mathematical functions; Algebra
MIT License
2.33k stars 240 forks source link

Studentized Range Distribution CDF #158

Open markrogoyski opened 8 years ago

markrogoyski commented 8 years ago

In order to add the Tukey's Range Test of statistical significance, it seems that we need the Studentized Range Distribution CDF.

Tukey's Range Test Studentized Range Studentized Range Distribution

I have not been able to find a lot of details on how to actually compute the CDF. I've seem some mentions of old Fortan algorithms and some complex approximations, but I haven't seen something that I would consider the definitive method to calculate this.

Is anyone familiar with how to compute the CDF of this distribution, or could point out some reference that has a method that is considered 'the right way' to do it?

Thanks.

Beakerboy commented 8 years ago

I've been referencing that "numerical Recipes" book, boost, and this: http://www.stat.rice.edu/~dobelman/textfiles/DistributionsHandbook.pdf

There's nothing in any of those sources. Is this helpful?

https://stat.ethz.ch/R-manual/R-devel/library/stats/html/Tukey.html

Beakerboy commented 8 years ago

Fortran code: http://lib.stat.cmu.edu/apstat/190

markrogoyski commented 8 years ago

Yeah, I ran into that Fortran code as well. I supposed as a first attempt I could try to implement that in PHP and see if the output resembles the distribution. PHP does have the goto operator after all =) http://php.net/manual/en/control-structures.goto.php

Beakerboy commented 8 years ago

I think this is it: https://projecteuclid.org/download/pdf_1/euclid.aoms/1177705684

markrogoyski commented 8 years ago

Cool. That's a good reference since it has all the tables precomputed.

I just skimmed through it, but this line stood out: The method of calculation of the probability integral ... will not be included here.

Hopefully there is enough information to figure this out. It's strange that Tukey's Range Test is fairly common, yet the distribution used to compute it has very little information about it online.

Beakerboy commented 8 years ago

It does have the equation to calculate the moments. Can you use that to figure out the PDF?

Beakerboy commented 8 years ago

https://www.jstor.org/stable/2332134?seq=1#page_scan_tab_contents

markrogoyski commented 8 years ago

In that document's page, the p(x) looks like the Normal distribution PDF function. Then it states No simple expressions exists for the probability law fn(w) of w...

I'm not sure how to use this information to code something up. All the other distributions are on Wikipedia with a nice formulas =)

Beakerboy commented 8 years ago

Did you look at page 309, after the first set of tables? I think that explains it as a double integral? when Anna equals to it somehow reduces the standard normal distribution.

markrogoyski commented 8 years ago

Ahh, I didn't realize there were more pages. OK. So if I register I can view the entire article. Thanks for pointing that out.

Beakerboy commented 8 years ago

Yes, register and it's free, although the quality is pretty poor. I'm having a hard time with some of the superscripted.

Beakerboy commented 8 years ago

Is this helpful?https://en.wikipedia.org/wiki/Range_(statistics)#Distribution

Beakerboy commented 8 years ago

I think I figured this out. I used the formula in the above "Range" Wikipedia article, where the distribution was the standard normal distribution. The critical values seem to agree with a Tukey table with infinite degrees of freedom.

https://docs.google.com/a/uwalumni.com/spreadsheets/d/13M2Z4F6tTE0VVVLvGdynwBdBekpFLoBE5KTTCs3JcrI/edit?usp=sharing

Edit: However, replacing it with a t distribution does not seem to agree with non-infinite df values.

Beakerboy commented 8 years ago

Here's a question I posted on stack exchange: http://stats.stackexchange.com/questions/235785/calculate-the-critical-value-of-tukey-q/235979#235979

markrogoyski commented 8 years ago

Thanks for continuing to look into this. Hopefully someone other than you answers your Stack Exchange question.

With so little information available, I wonder if the online ANOVA calculators that do the Tukey's Range test are just using pre-computed tables.

markrogoyski commented 8 years ago

I think I found the R implementation for these functions:

ptukey: https://github.com/wch/r-source/blob/e5b21d0397c607883ff25cca379687b86933d730/src/nmath/ptukey.c

qtukey: https://github.com/wch/r-source/blob/e5b21d0397c607883ff25cca379687b86933d730/src/nmath/qtukey.c

Beakerboy commented 8 years ago

When you figure out what they do on a theoretical level, I'd love to know.

Beakerboy commented 8 years ago

Thinking this over...the studentized range is supposed to include the standard deviation of the samples. When the number of samples (df) approaches infinity, this estimate of s will approach one. I'm assuming there's a missing factor to correct for the estimation of s from the samples somewhere: https://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation

Beakerboy commented 8 years ago

I think I figured it out, it's a two tailed test. I don't know how to modify my integral to account for that though. If you compare a chart of critical t values for a two tailed test at .05, and multiply by sqrt(2), it will match a tukey chart of critical q with k=2 and the same df.

markrogoyski commented 8 years ago

Once we have it all figured out, someone should write a blog post or update the Wikipedia page. It would end up being the definitive online source for this distribution.

Beakerboy commented 8 years ago

I started an article on Wikipedia: https://en.wikipedia.org/wiki/Studentized_range_distribution

Beakerboy commented 8 years ago

If you click "Next item" a few times in the Biometrika article above, to page 334, there's another technical article that is probably helpful.

The Range in Random Samples H. O. Hartley Biometrika Vol. 32, No. 3/4 (Apr., 1942), pp. 334-348

Beakerboy commented 8 years ago

...And I think I finally found the generalized equation. It's in the wiki article. I'm trying to verify this PDF by numerically integrating it to the CDF using a t distribution for f(q).

markrogoyski commented 8 years ago

Wow. Great work on the wiki article. Thanks for doing this.

Beakerboy commented 7 years ago

Here's the literature source for the fortran code from above. I think this fills the blanks in some on what it is actually numerically integrating. http://www.jstor.org/stable/2347300?seq=1#page_scan_tab_contents

Beakerboy commented 7 years ago

Here's a paper with the same equation in a different form. I'm still trying to figure out how this formula arises. i think I have an intuitive sense on the inner integral, but I have to figure out why the out one estimates the standard deviation. Related to the Chi-Squared distribution somehow? http://link.springer.com/article/10.3758/BF03202264

markrogoyski commented 7 years ago

Cool. Thanks for finding and sharing the Fortran code.

Beakerboy commented 7 years ago

I have something in the works on this if you would like to putz with it: https://github.com/Beakerboy/math-php/blob/StudentizedRange/src/Probability/Distribution/Continuous/StudentizedRange.php

markrogoyski commented 7 years ago

Thanks for continuing to work on this!