markrogoyski / math-php

Powerful modern math library for PHP: Features descriptive statistics and regressions; Continuous and discrete probability distributions; Linear algebra with matrices and vectors, Numerical analysis; special mathematical functions; Algebra
MIT License
2.32k stars 238 forks source link

Truncated Mean Percentage #440

Closed ktorktor closed 2 years ago

ktorktor commented 2 years ago

Hi,

Thank you for making such a great math library!

I noticed a little error when calculating the Truncated Mean, when the percentage is set to 25% it actually calculates the truncated mean with a percentage of 50%

A quick fix would be to change line 510 in Average.php to:

$trim_count = \floor($n * ($trim_percent/ 2/ 100));

markrogoyski commented 2 years ago

Hi @ktorktor,

Thanks for your interest in MathPHP and your kind words.

Can you give an example input and expected output of where you think the library is miscalculating the truncated mean?

Thanks, Mark

ktorktor commented 2 years ago

Let's take the array:

6,4,2,4,3,7,6,33,77,22,3,5,6,5,0,2,3,4,6

if I set the percentage to 50% it should take out 25% from one end and 25% from the other end.

there are 19 numbers, so it should take out 8 numbers, 4 from one end and 4 from the other.

The sorted array is

0,2,2,3,3,3,4,4,4,5,5,6,6,6,6,7,22,33,77

3,3,4,4,4,5,5,6,6,6,6

(3+3+4+4+4+5+5+6+6+6+6)/11 = 4.7272727273

I get this result when I put 25% in the truncatedMean function but I think this percentage should be the percentage of the total numbers deducted, not of the numbers deducted from one end.

From Wikipedia: "This number of points to be discarded is usually given as a percentage of the total number of points" https://en.wikipedia.org/wiki/Truncated_mean

From Investopedia: "A trimmed mean is stated as a mean trimmed by x%, where x is the sum of the percentage of observations removed from both the upper and lower bounds. The trimming points are often arbitrary in that they follow rules of thumb rather than some optimized method of setting those thresholds. For example, a trimmed mean of 3% would remove the lowest and highest 3% of values, leaving the mean to be calculated from the 94% of remaining data." https://www.investopedia.com/terms/t/trimmed_mean.asp

Hope it helps.

markrogoyski commented 2 years ago

Hi @ktorktor,

Thanks for providing an example.

Here is the result of your data set in three different math libraries: MathPHP, R, and NumPy/SciPy.

MathPHP

php > $nums = [6,4,2,4,3,7,6,33,77,22,3,5,6,5,0,2,3,4,6];
php > echo \MathPHP\Statistics\Average::truncatedMean($nums, 50);
5
php > echo \MathPHP\Statistics\Average::truncatedMean($nums, 25);
4.7272727272727

R

> nums <- c(6,4,2,4,3,7,6,33,77,22,3,5,6,5,0,2,3,4,6)
> mean(nums, trim=0.50)
[1] 5
> mean(nums, trim=0.25)
[1] 4.727273

Python (SciPy)

In [5]: from scipy import stats

In [6]: nums = [6,4,2,4,3,7,6,33,77,22,3,5,6,5,0,2,3,4,6]

In [7]: stats.trim_mean(nums, 0.50)
Out[7]: 5.0

In [8]: stats.trim_mean(nums, 0.25)
Out[8]: 4.7272727272727275

All three libraries seem to agree on the answer. I also checked with this online calculator: https://www.easycalculation.com/statistics/trimmed-mean.php and it agreed with the results.

I think this is just an issue of semantics. Does 25% mean 25% from each side, or 25% of all numbers. It seems like it means for former for most programatic usages.

It looks like there is no issue with the calculation and is in alignment with other mathematics software libraries. However, I do think there is a bounds issue if you go above 50%. I'll update the code to provide a better error message.

Let me know if you have any other questions or comments about this. Thanks. Mark

ktorktor commented 2 years ago

Thank you, Mark, I think this is just an issue of semantics as you said, there is nothing wrong with the calculation. I used TRIMMEAN from Google Sheets or Excel were from the same set of numbers as above. = TRIMMEAN(range,0.5) will result 4.73

It seems there isn't a well-established standard for this, or Microsoft first, and then Google got it wrong (without irony, can be). It helps to say in the comment that is a percentage cut from one end.

@param int $trim_percent Percent between 0-99

should be

@param int $trim_percent Percent between 0-49

markrogoyski commented 2 years ago

This has been addressed in the latest version v2.6.0. Thanks again for reporting this issue.