sagemath / sage

Main repository of SageMath
https://www.sagemath.org
Other
1.44k stars 480 forks source link

statistics #33453

Open videlec opened 2 years ago

videlec commented 2 years ago

We implement a sage compatible version of the Python statistics module in stats/statistics.py. In particular, it will provide the mean and median functions that were recently deprecated for removal in #29662. See also #33432.

See also https://docs.python.org/3.10/whatsnew/3.8.html#statistics

CC: @mkoeppe

Component: statistics

Author: Vincent Delecroix

Branch/Commit: u/vdelecroix/33453 @ 9e82df1

Reviewer: Matthias Koeppe, David Roe

Issue created by migration from https://trac.sagemath.org/ticket/33453

videlec commented 2 years ago

Description changed:

--- 
+++ 
@@ -1 +1 @@
-We implement a sage compatible version of the [statistics module](https://docs.python.org/3/library/statistics.html) in `stats/statistics.py`. In particular, it will provide the `mean` and `median` functions that were recently deprecated for removal in #29662. See also #33432.
+We implement a sage compatible version of the [Python statistics module](https://docs.python.org/3/library/statistics.html) in `stats/statistics.py`. In particular, it will provide the `mean` and `median` functions that were recently deprecated for removal in #29662. See also #33432.
videlec commented 2 years ago

New commits:

3e1375533453: a basic statistics module
videlec commented 2 years ago

Commit: 3e13755

videlec commented 2 years ago

Branch: u/vdelecroix/33453

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 2 years ago

Changed commit from 3e13755 to 0abcc58

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 2 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

0abcc5833453: add statistics module to doc
7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 2 years ago

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

304d7db33453: a basic statistics module
2dc0bac33453: add statistics module to doc
0a44a4f33453: mitigate the effects of 29662
7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 2 years ago

Changed commit from 0abcc58 to 0a44a4f

kcrisman commented 2 years ago
comment:6

Thank you, this looks like a great addition which should take care of the necessary problems but continue providing what is needed. I didn't see any obvious problems in the code but it would be good to get a second more detailed eye.

mkoeppe commented 2 years ago
comment:7

Your new function mean is not compatible with the built-in statistics.mean - that specifies "If data is empty, StatisticsError will be raised." https://docs.python.org/3/library/statistics.html#statistics.mean

(Also, the argument is called data, not v.)

mkoeppe commented 2 years ago
comment:8

Also, please include cross-references to the numpy functions in the documentation so that this information is not lost.

videlec commented 2 years ago
comment:10

Replying to @mkoeppe:

Your new function mean is not compatible with the built-in statistics.mean - that specifies "If data is empty, StatisticsError will be raised." https://docs.python.org/3/library/statistics.html#statistics.mean

True. I did that on purpose to follow the numpy behaviour. The current code that I wrote only emits a RuntimeWarning.

(Also, the argument is called data, not v.)

Is the argument name really relevant? Though data is definitely much better.

mkoeppe commented 2 years ago
comment:11

Replying to @videlec:

Replying to @mkoeppe:

Your new function mean is not compatible with the built-in statistics.mean - that specifies "If data is empty, StatisticsError will be raised." https://docs.python.org/3/library/statistics.html#statistics.mean

True. I did that on purpose to follow the numpy behaviour. The current code that I wrote only emits a RuntimeWarning.

That makes no sense - the point of the module is to be compatible not with numpy but with the built-in statistics module.

mkoeppe commented 2 years ago
comment:12

Replying to @videlec:

(Also, the argument is called data, not v.)

Is the argument name really relevant? Though data is definitely much better.

Yes, because the signature is mean(data), not mean(data, /), users are allowed to call it as mean(data=...).

videlec commented 2 years ago
comment:13

Note thate Python argument naming is awful

mkoeppe commented 2 years ago
comment:14

Compatibility > beauty

videlec commented 2 years ago
comment:15

Replying to @mkoeppe:

Compatibility > beauty

This is not about beauty but coherence. However this is a very minor point. It is perfectly fine to keep as much as the Python world as we can.

videlec commented 2 years ago
comment:16

More importantly

mkoeppe commented 2 years ago
comment:17

The distinction of xbar and mu is deliberate, see discussion in https://bugs.python.org/issue20389

mkoeppe commented 2 years ago
comment:18

Replying to @videlec:

More importantly

  • if the user provides a numpy array then the appropriate numpy method is called. This is not the Python behaviour. Should I remove it?

Computing it via the numpy method I think is a good idea; but the result type / error handling must be compatible with the other types.

  • If data only contains builtin Python data (int, float, Fraction, Decimal, ...) should the code simply transfer to the Python statistics module? Usually, sage functions tend to use py_scalar_to_element which does convert int -> Integer, float -> RealNumber, etc

I think it would make sense to always make the result a Sage type, via py_scalar_to_element if necessary

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 2 years ago

Changed commit from 0a44a4f to a60c096

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 2 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

a60c09633453: follow python specifications + improved doc
videlec commented 2 years ago
comment:21

Replying to @mkoeppe:

Also, please include cross-references to the numpy functions in the documentation so that this information is not lost.

Not sure about what you meant here.

mkoeppe commented 2 years ago
comment:22

Can't just replace sage.stats.basic_stats.mean by lazy_import('sage.stats.statistics', 'mean', deprecation=33453). They have a different specification.

davidlowryduda commented 2 years ago
comment:24

I have a very minor observation. I think there are a few typos in sage.stats.statistics in the updated deprecation notices:

--- basic_stats.sage
+++ basic_stats.sage with small changes
@@ -76,7 +76,7 @@

         sage: std([1..6], bias=True)
         doctest:warning...
-        DeprecationWarning: sage.stats.basic_stats.std is deprecated; use sage.stats.statstics.stdev or sage.stats.statistics.pstdev instead
+        DeprecationWarning: sage.stats.basic_stats.std is deprecated; use sage.stats.statistics.stdev or sage.stats.statistics.pstdev instead
         See https://trac.sagemath.org/33453 for details.
         1/2*sqrt(35/3)
         sage: std([1..6], bias=False)
@@ -106,7 +106,7 @@
         sage: std(data)  # random
         0.29487771726609185
     """
-    deprecation(33453, 'sage.stats.basic_stats.std is deprecated; use sage.stats.statstics.stdev or sage.stats.statistics.pstdev instead')
+    deprecation(33453, 'sage.stats.basic_stats.std is deprecated; use sage.stats.statistics.stdev or sage.stats.statistics.pstdev instead')

     if hasattr(v, 'standard_deviation'):
         return v.standard_deviation(bias=bias)
7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 2 years ago

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

7b8ba3c33453: mitigate the effects of 29662
9c82ac933453: follow python specifications + improved doc
002b83a33453: cleaning
7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 2 years ago

Changed commit from a60c096 to 002b83a

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 2 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

7cf846c33453: statstics -> statistics
7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 2 years ago

Changed commit from 002b83a to 7cf846c

mkoeppe commented 2 years ago
comment:28

Replying to @videlec:

Replying to @mkoeppe:

Also, please include cross-references to the numpy functions in the documentation so that this information is not lost.

Not sure about what you meant here.

     We define the mean of the empty list to be the (symbolic) NaN,
     following the convention of MATLAB, Scipy, and R.

-    This function is deprecated.  Use ``numpy.mean`` or ``numpy.nanmean``
-    instead.
+    This function is deprecated.  Use ``sage.stats.statistics.mean`` instead. The
+    differences with this function are
+
+    - the code does not try to call ``v.mean()``
+    - raises an error on empty input

Stuff like this -- please don't remove cross-references to numpy.

mkoeppe commented 2 years ago
comment:29

In

+    if is_numpy_type(type(data)):
+        import numpy
+        if isinstance(data, numpy.ndarray):
+            return data.mean()

I think the result should be coerced to a Sage number type (comment:18)

videlec commented 2 years ago
comment:30

Replying to @mkoeppe:

In

+    if is_numpy_type(type(data)):
+        import numpy
+        if isinstance(data, numpy.ndarray):
+            return data.mean()

I think the result should be coerced to a Sage number type (comment:18)

I don't like the conversion afterwards so much. The mean of a list of integers is a floating point in numpy.

sage: data = [1,2,7,-11,15,23]
sage: statistics.mean(data)
37/6
sage: import sage.stats.statistics as statistics
sage: import numpy
sage: statistics.mean(numpy.array(data))
6.166666666666667

Maybe I should just remove these numpy shortcuts and simply document how to use the proper numpy methods in the documentation only?

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 2 years ago

Changed commit from 7cf846c to 9e82df1

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 2 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

9e82df133453: clean doc
roed314 commented 2 years ago
comment:32

There are pyflakes errors. Once those are fixed, I'm happy to give this a positive review.

roed314 commented 2 years ago

Reviewer: David Roe

videlec commented 2 years ago
comment:33

Replying to @roed314:

There are pyflakes errors. Once those are fixed, I'm happy to give this a positive review.

Please don't. comment:30 has to be sorted out first.

Also, I would like to mitigate the warnings in basic_stats. Currently, the only deprecated behaviour of mean(v) are when either

mkoeppe commented 2 years ago

Changed reviewer from David Roe to Matthias Koeppe, David Roe

mkoeppe commented 2 years ago

Description changed:

--- 
+++ 
@@ -1 +1,3 @@
 We implement a sage compatible version of the [Python statistics module](https://docs.python.org/3/library/statistics.html) in `stats/statistics.py`. In particular, it will provide the `mean` and `median` functions that were recently deprecated for removal in #29662. See also #33432.
+
+See also https://docs.python.org/3.10/whatsnew/3.8.html#statistics
roed314 commented 1 year ago

I just hit this deprecation warning again. It would be nice to remove it.