markrogoyski / math-php

Powerful modern math library for PHP: Features descriptive statistics and regressions; Continuous and discrete probability distributions; Linear algebra with matrices and vectors, Numerical analysis; special mathematical functions; Algebra
MIT License
2.32k stars 238 forks source link

Quartiles calculation in Stats Descriptive #454

Closed davidjr82 closed 2 years ago

davidjr82 commented 2 years ago

Having this array:

$numbers = [0,900,1800,2700,3600,4500];

And having, therefore, these percentiles (inclusive, range 0..1):

[
     0 => 0,
     900 => 0.2,
     1800 => 0.4,
     2700 => 0.6,
     3600 => 0.8,
     4500 => 1,
]

If I ask for the quartiles, I expect the first quartile (just as an example) to be between 900 and 1800 (0.25 should be 1125), but it is 900.

$quartiles = MathPHP\Statistics\Descriptive::quartilesInclusive([0,900,1800,2700,3600,4500]);

// unexpected result
[
     "0%" => 0,
     "Q1" => 900.0, // UNEXPECTED, SHOULD BE 1125
     "Q2" => 2250.0,
     "Q3" => 3600.0,
     "100%" => 4500,
     "IQR" => 2700.0,
]

Q1 should have the same value as the percentile 25th, but it has the value of the 20th percentile (Wikipedia Quartile definition)

Is this a bug, or there is something I am missing?

Thanks!

markrogoyski commented 2 years ago

Hi @davidjr82,

Thank you for your interest in MathPHP.

Quartiles, unfortunately, do not have a single standard way to compute them. In R for instance, there are nine different variations. Excel has two. The Wikipedia article shows four. MathPHP's documentation for quartilesInclusive indicates it uses the "Tukey's hinges" quartile method, which is "method 2" in the Wikipedia article.

Method 2 Use the median to divide the ordered data set into two-halves. If there are an odd number of data points in the original ordered data set, include the median (the central value in the ordered list) in both halves. If there are an even number of data points in the original ordered data set, split this data set exactly in half. The lower quartile value is the median of the lower half of the data. The upper quartile value is the median of the upper half of the data. The values found by this method are also known as "Tukey's hinges";[4] see also midhinge.

Using your dataset and computing the Wikipedia method 2 by hand. [0,900,1800,2700,3600,4500]

Use the median to divide the ordered data set into two-halves.

There is an even number of numbers, so the median is the average of 1800 and 2700 which is 2250.

If there are an even number of data points in the original ordered data set, split this data set exactly in half.

Lower half = [0, 900, 1800] Upper half = [2700, 3600, 4500]

The lower quartile value is the median of the lower half of the data

The median of [0, 900, 1800] is 900.

The upper quartile value is the median of the upper half of the data.

The median of [2700, 3600, 4500] is 3600.

This matches the result MathPHP provides.

Also for reference, there are multiple quartile methods in R which give the same result:

> quantile(c(0, 900, 1800, 2700, 3600, 4500), type=2)
  0%  25%  50%  75% 100% 
   0  900 2250 3600 4500 

> quantile(c(0, 900, 1800, 2700, 3600, 4500), type=5)
  0%  25%  50%  75% 100% 
   0  900 2250 3600 4500

Keep in mind there is also a Descriptive::percentile function you can use which has a more "standard" definition if that is what you are looking for.

Descriptive::percentile([0,900,1800,2700,3600,4500], 25)  // 1125