sagemath / sage

Main repository of SageMath
https://www.sagemath.org
Other
1.47k stars 485 forks source link

Deprecate sage.stats.basic_stats #29662

Closed kcrisman closed 3 years ago

kcrisman commented 4 years ago

Right now in sage stats there are a few very basic things, and then lots of very technical stuff sampling from e.g. polynomials or for hidden markov modeling!

In this ticket, we deprecate sage.stats.basic_stats, which is underdeveloped. Users are usually better off to learn to use other facilities available in Python that provide better coverage and consistency. For example, mean happens to work with vectors (which is used in some doctests elsewhere in sage), but variance does not.

Note that Python 3 comes with a basic built-in stats module, but much of its functionality is incompatible with Sage's numbers in #28234.

In the deprecation messages, we instead point users to suitable numpy and scipy functions, as well as to pandas (for moving_average).

See also #29663

CC: @dcoudert @tscrim

Component: statistics

Author: Matthias Koeppe

Branch: 7e7331a

Reviewer: David Coudert

Issue created by migration from https://trac.sagemath.org/ticket/29662

dimpase commented 4 years ago

Description changed:

--- 
+++ 
@@ -1,3 +1,9 @@
 Right now in [sage stats](http://doc.sagemath.org/html/en/reference/stats/index.html) there are a few very basic things, and then lots of very technical stuff sampling from e.g. polynomials or for hidden markov modeling!

 This ticket is to split the more technical stuff (which presumably may still be used for researchers, but not for the sort of things basic R or pandas data frames would be) into a separate module where it can be taken care of.
+
+Note that Python 3 comes with built-in stats module.
+https://docs.python.org/3/library/statistics.html
+
+See also #29662
+
kcrisman commented 4 years ago
comment:2

Thanks for that comment about Py3. So perhaps there are already namespace clashes ... though presumably that module can't handle (say) the mean of several Integers or even stranger objects, as-is.

kcrisman commented 4 years ago

Description changed:

--- 
+++ 
@@ -5,5 +5,5 @@
 Note that Python 3 comes with built-in stats module.
 https://docs.python.org/3/library/statistics.html

-See also #29662
+See also #29663
dimpase commented 4 years ago
comment:4

Replying to @kcrisman:

Thanks for that comment about Py3. So perhaps there are already namespace clashes ... though presumably that module can't handle (say) the mean of several Integers or even stranger objects, as-is.

this might be an upstream bug. One should check whether this works with unusual numpy types, by the way. If it doesn't we have a case...

mkoeppe commented 3 years ago
comment:5

Unfortunately the Python3 statistics module is not compatible with Sage numbers. This is coming from #28234.

sage: import statistics                                                                                                                                                 
sage: statistics.mean([1, 2])                                                                                                                                           
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-c5dde83ca691> in <module>
----> 1 statistics.mean([Integer(1), Integer(2)])

/usr/local/Cellar/python@3.9/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/statistics.py in mean(data)
    314     if n < 1:
    315         raise StatisticsError('mean requires at least one data point')
--> 316     T, total, count = _sum(data)
    317     assert count == n
    318     return _convert(total / n, T)

/usr/local/Cellar/python@3.9/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/statistics.py in _sum(data, start)
    166         for n, d in map(_exact_ratio, values):
    167             count += 1
--> 168             partials[d] = partials_get(d, 0) + n
    169     if None in partials:
    170         # The sum will be a NAN or INF. We can ignore all the finite

TypeError: unsupported operand type(s) for +: 'int' and 'builtin_function_or_method'
mkoeppe commented 3 years ago
comment:6

So to get rid of the single use of sage.stats within sage.graphs, a workaround like in this commit is needed. I don't know if this is acceptable

mkoeppe commented 3 years ago

Author: Matthias Koeppe

mkoeppe commented 3 years ago
comment:7

Python 3.8 introduces https://docs.python.org/3/library/statistics.html#statistics.fmean, which coerces to float first and therefore works fine; but we cannot use it yet as we still support Python 3.7. Also geometric_mean (Python >= 3.8), median, median_low, median_high, median_grouped work fine for the same reason.

mode, multimode, and quantiles also work fine. They do not convert to float.

OTOH, harmonic_mean, pstdev, pvariance, stdev, variance all lead to the same error as mean.

mkoeppe commented 3 years ago

Branch: u/mkoeppe/clarify_stats_module_role

mkoeppe commented 3 years ago
comment:9

numpy and scipy.stats functions do not have this problem; they just convert everything to numpy.float64 first.

So if we want to deprecate sage.stats.basic_stats, the deprecation messages should probably send users to a combination of numpy and scipy functions.


New commits:

2be6a71GenericGraph.clustering_average: Replace use of sage.stats.basic_stats by statistics
mkoeppe commented 3 years ago

Commit: 2be6a71

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Changed commit from 2be6a71 to d10694a

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

d10694asage.stats.basic_stats: Add deprecations for all functions
dcoudert commented 3 years ago
comment:13

Currently, clustering_coeff returns a rational when implementation is "dense_copy" or "sparse_copy". Unfortunately, the rational field of sage is currently not able to take as input the type Fraction from Python library fractions.

A solution could be

from statistics import mean
from fractions import Fraction
return QQ(str(mean([Fraction(str(x)) for x in G.clustering_coeff(implementation=implementation).values()])))
mkoeppe commented 3 years ago
comment:14

OK. Sounds like having the result in QQ is important to you.

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

933f0aasage.stats.basic_stats: Add deprecations for all functions
d3d374bGenericGraph.clustering_average: Remove use of sage.stats.basic_stats
7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Changed commit from d10694a to d3d374b

mkoeppe commented 3 years ago
comment:16

... so instead of going through Fraction I have just replaced it by sum followed by division

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Changed commit from d3d374b to 03e926f

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

03e926fsrc/sage/stats/basic_stats.py: Update doctest outputs with deprecation warnings
dcoudert commented 3 years ago

Reviewer: David Coudert

dcoudert commented 3 years ago
comment:19

LGTM. Thank you.

mkoeppe commented 3 years ago
comment:20

Thanks!

vbraun commented 3 years ago
comment:21
...
    :
    DeprecationWarning: sage.stats.basic_stats.mean is deprecated; use numpy.mean or numpy.nanmean instead
    See https://trac.sagemath.org/29662 for details.
**********************************************************************
1 item had failures:
   1 of  14 in sage.crypto.lwe.LWE.__init__
    [111 tests, 1 failure, 0.32 s]
----------------------------------------------------------------------
sage -t --long --warn-long 46.3 --random-seed=0 src/sage/crypto/lwe.py  # 1 doctest failed
----------------------------------------------------------------------
7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Changed commit from 03e926f to 5966df9

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

5966df9src/sage/crypto/lwe.py: Replace use of sage.stats.basic_stats in doctest by numpy
dcoudert commented 3 years ago
comment:24

The patchbot reports several errors due to mean

sage -t --long --random-seed=0 src/sage/stats/distributions/discrete_gaussian_lattice.py  # 3 doctests failed
sage -t --long --random-seed=0 src/doc/en/prep/Quickstarts/Statistics-and-Distributions.rst  # 2 doctests failed
sage -t --long --random-seed=0 src/sage/stats/distributions/discrete_gaussian_polynomial.py  # 1 doctest failed
sage -t --long --random-seed=0 src/sage/coding/information_set_decoder.py  # 2 doctests failed
sage -t --long --random-seed=0 src/sage/graphs/graph_generators_pyx.pyx  # 1 doctest failed
mkoeppe commented 3 years ago

Description changed:

--- 
+++ 
@@ -1,9 +1,11 @@
 Right now in [sage stats](http://doc.sagemath.org/html/en/reference/stats/index.html) there are a few very basic things, and then lots of very technical stuff sampling from e.g. polynomials or for hidden markov modeling!

-This ticket is to split the more technical stuff (which presumably may still be used for researchers, but not for the sort of things basic R or pandas data frames would be) into a separate module where it can be taken care of.
+In this ticket, we deprecate `sage.stats.basic_stats`, which is underdeveloped. Users are usually better off to learn to use other facilities available in Python that provide better coverage and consistency. For example, `mean` happens to work with vectors (which is used in some doctests elsewhere in sage), but `variance` does not.

-Note that Python 3 comes with built-in stats module.
-https://docs.python.org/3/library/statistics.html
+Note that Python 3 comes with a basic [built-in stats module](https://docs.python.org/3/library/statistics.html), but much of its functionality is incompatible with Sage's numbers in #28234.
+
+In the deprecation messages, we instead point users to suitable `numpy` and `scipy` functions, as well as to `pandas` (for `moving_average`).
+

 See also #29663
7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Changed commit from 5966df9 to 88745e1

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

88745e1src/sage/stats/distributions/discrete_gaussian_lattice.py: Remove use of sage.basic_stats.mean in doctests
7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Changed commit from 88745e1 to c6e40a5

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

fb10839src/sage/stats/distributions/discrete_gaussian_polynomial.py: Remove use of sage.stats.basic_stats.mean in doctest
768ebdfsrc/sage/graphs/graph_generators_pyx.pyx: Remove use of sage.stats.basic_stats.mean in doctest
c6e40a5src/sage/coding/information_set_decoder.py: Remove use of sage.stats.basic_stats; make imports more specific
7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Changed commit from c6e40a5 to 7e7331a

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

7e7331asrc/doc/en/prep/Quickstarts/Statistics-and-Distributions.rst: Update, stop recommending deprecated functions
dcoudert commented 3 years ago
comment:30

I think it's OK now. We have a few pyflake warnings that can be fixed afterward.

mkoeppe commented 3 years ago
comment:31

Thanks for reviewing!

vbraun commented 3 years ago

Changed branch from u/mkoeppe/clarify_stats_module_role to 7e7331a

williamstein commented 2 years ago
comment:33

The definition of "numpy.mean" is different than Sage's built in mean:

sage: mean([vector([1,2]),vector([3,4])])
<ipython-input-4-37be8bb3881f>:1: DeprecationWarning: sage.stats.basic_stats.mean is deprecated; use numpy.mean or numpy.nanmean instead
See https://trac.sagemath.org/29662 for details.
  mean([vector([Integer(1),Integer(2)]),vector([Integer(3),Integer(4)])])
(2, 3)
sage: numpy.mean([vector([1,2]),vector([3,4])])
2.5

Also, one of the examples in the Sage docs involving mean doesn't work with numpy.mean:

sage: v = stats.TimeSeries([1..10])
sage: mean(v)
<ipython-input-13-9d6545956b1d>:1: DeprecationWarning: sage.stats.basic_stats.mean is deprecated; use numpy.mean or numpy.nanmean instead
See https://trac.sagemath.org/29662 for details.
  mean(v)
5.5
sage: numpy.mean(v)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-14-2e91276b932b> in <module>
----> 1 numpy.mean(v)

<__array_function__ internals> in mean(*args, **kwargs)

/ext/sage/9.5/local/var/lib/sage/venv-python3.9.9/lib/python3.9/site-packages/numpy/core/fromnumeric.py in mean(a, axis, dtype, out, keepdims, where)
   3436             pass
   3437         else:
-> 3438             return mean(axis=axis, dtype=dtype, out=out, **kwargs)
   3439 
   3440     return _methods._mean(a, axis=axis, dtype=dtype,

TypeError: TimeSeries.mean() takes no keyword arguments

Maybe this is an argument to improve variance, rather than delete it?

For example, mean happens to work with vectors (which is used in some doctests elsewhere in sage), but variance does not.

williamstein commented 2 years ago

Changed commit from 7e7331a to none

dimpase commented 2 years ago
comment:34

perhaps removing mean, variance, etc. from global namespace would be enough for the time being.

videlec commented 2 years ago
comment:35

Replying to @dimpase:

perhaps removing mean, variance, etc. from global namespace would be enough for the time being.

I think that having sage versions of mean and variance in the global namespace are useful to many users. Much more than EllipticCurve let say.

numpy and python are floating point focused when it comes to mathematics. This is not what sage tries to teach.

dimpase commented 2 years ago
comment:36

By default, EllipticCurve, or in fact anything that's not in vanilla Python, should not be in global namespace. This is in line with other Python packages.

By the way, one of the worst is currently sum, which overloads Python's sum at Sage prompt, but does not do this in the library.

Having said that, nothing then would prevent one from having "classical Sage" module, which populates the global namespace with stuff.

nbruin commented 2 years ago
comment:37

Replying to @dimpase:

By default, EllipticCurve, or in fact anything that's not in vanilla Python, should not be in global namespace. This is in line with other Python packages.

Indeed, and then with from sage.all import * people can get a standard set of stuff in their global namespace. And the sage command environment should probably preload that, just as the preparser. It's great if sage can work like a python package, but the sage-specific environment is very valuable for quick stuff. And for, for instance, teaching, it is very useful if there is a useful "standard" set, if only so that it's easily available in sagecell.

It's valuable that in maple, maxima, mathematica, you can start right away with typing in math one-liners and get results, without having to execute boiler plate. We have that in sagemath now as well. I think it would reduce the usability of sagemath in many informal settings if we were to do away with it. A python library is just not a very good match for what people expect from an interactive CAS, so I think it's good we have a shim layer that makes sagemath behave a little more like a traditional CAS.

slel commented 2 years ago
comment:38

Discussions after this ticket was merged:

slel commented 2 years ago
comment:39

Follow-up ticket: #33432

videlec commented 2 years ago
comment:40

Follow-up ticket: #33453