vnmabus / dcor

Distance correlation and related E-statistics in Python

https://dcor.readthedocs.io

MIT License

145 stars 26 forks source link

Add configerable average function #23

Closed multimeric closed 3 years ago

multimeric commented 3 years ago

Closes #22, see discussion there.

Uncertainties:

Have I covered all public APIs, ensuring they can all be configured?
The test statistic ends up being negative, and therefore with a p-value of 1 when used to compare a standard normal and t distribution in the test_different_distributions. Does this make sense, or is it revealing a flaw in the code somewhere?

codecov[bot] commented 3 years ago

Codecov Report

Merging #23 (b10e89b) into develop (e735155) will increase coverage by 0.07%. The diff coverage is 100.00%.

@@             Coverage Diff             @@
##           develop      #23      +/-   ##
===========================================
+ Coverage    95.95%   96.03%   +0.07%     
===========================================
  Files           18       18              
  Lines         1137     1159      +22     
===========================================
+ Hits          1091     1113      +22     
  Misses          46       46

Impacted Files	Coverage Δ
dcor/homogeneity.py	`100.00% <ø> (ø)`
dcor/tests/test_independence.py	`100.00% <ø> (ø)`
dcor/_energy.py	`91.66% <100.00%> (+0.75%)`	:arrow_up:
dcor/tests/test_homogeneity.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update e735155...b10e89b. Read the comment docs.

vnmabus commented 3 years ago

* Have I covered all public APIs, ensuring they can all be configured?

I think so, although the documentation for the public function is missing (and in a future PR I should merge the public and _imp versions of the functions, as this was only done to achieve keyword only parameters for Python 2, which is no longer supported).

* The test statistic ends up being negative, and therefore with a p-value of 1 when used to compare a standard normal and t distribution in the `test_different_distributions`. Does this make sense, or is it revealing a flaw in the code somewhere?

You mean the statistic with the mean, median or both?

multimeric commented 3 years ago

You mean the statistic with the mean, median or both?

I mean with the median. Using the mean is one of your tests, which passes.

vnmabus commented 3 years ago

You mean the statistic with the mean, median or both?

I mean with the median. Using the mean is one of your tests, which passes.

Ok, I have checked the implementation and it looks ok. The only explanation that I see is that the differences between the Gaussian and t-Student distributions are in the tails of the distribution, and the median is not taking this information into account. Maybe it will notice the difference with a higher number of samples, but it would be very costly. So I would add a test between two different enough distributions with the same mean, and call it a day, unless you have a better explanation.