Closed kcrisman closed 3 years ago
Description changed:
---
+++
@@ -1,3 +1,9 @@
Right now in [sage stats](http://doc.sagemath.org/html/en/reference/stats/index.html) there are a few very basic things, and then lots of very technical stuff sampling from e.g. polynomials or for hidden markov modeling!
This ticket is to split the more technical stuff (which presumably may still be used for researchers, but not for the sort of things basic R or pandas data frames would be) into a separate module where it can be taken care of.
+
+Note that Python 3 comes with built-in stats module.
+https://docs.python.org/3/library/statistics.html
+
+See also #29662
+
Thanks for that comment about Py3. So perhaps there are already namespace clashes ... though presumably that module can't handle (say) the mean of several Integer
s or even stranger objects, as-is.
Description changed:
---
+++
@@ -5,5 +5,5 @@
Note that Python 3 comes with built-in stats module.
https://docs.python.org/3/library/statistics.html
-See also #29662
+See also #29663
Replying to @kcrisman:
Thanks for that comment about Py3. So perhaps there are already namespace clashes ... though presumably that module can't handle (say) the mean of several
Integer
s or even stranger objects, as-is.
this might be an upstream bug. One should check whether this works with unusual numpy types, by the way. If it doesn't we have a case...
Unfortunately the Python3 statistics
module is not compatible with Sage numbers. This is coming from #28234.
sage: import statistics
sage: statistics.mean([1, 2])
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-3-c5dde83ca691> in <module>
----> 1 statistics.mean([Integer(1), Integer(2)])
/usr/local/Cellar/python@3.9/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/statistics.py in mean(data)
314 if n < 1:
315 raise StatisticsError('mean requires at least one data point')
--> 316 T, total, count = _sum(data)
317 assert count == n
318 return _convert(total / n, T)
/usr/local/Cellar/python@3.9/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/statistics.py in _sum(data, start)
166 for n, d in map(_exact_ratio, values):
167 count += 1
--> 168 partials[d] = partials_get(d, 0) + n
169 if None in partials:
170 # The sum will be a NAN or INF. We can ignore all the finite
TypeError: unsupported operand type(s) for +: 'int' and 'builtin_function_or_method'
So to get rid of the single use of sage.stats
within sage.graphs
, a workaround like in this commit is needed. I don't know if this is acceptable
Author: Matthias Koeppe
Python 3.8 introduces https://docs.python.org/3/library/statistics.html#statistics.fmean, which coerces to float
first and therefore works fine; but we cannot use it yet as we still support Python 3.7. Also geometric_mean
(Python >= 3.8), median
, median_low
, median_high
, median_grouped
work fine for the same reason.
mode
, multimode
, and quantiles
also work fine. They do not convert to float
.
OTOH, harmonic_mean
, pstdev
, pvariance
, stdev
, variance
all lead to the same error as mean
.
numpy and scipy.stats functions do not have this problem; they just convert everything to numpy.float64
first.
So if we want to deprecate sage.stats.basic_stats
, the deprecation messages should probably send users to a combination of numpy and scipy functions.
New commits:
2be6a71 | GenericGraph.clustering_average: Replace use of sage.stats.basic_stats by statistics |
Branch pushed to git repo; I updated commit sha1. New commits:
d10694a | sage.stats.basic_stats: Add deprecations for all functions |
Currently, clustering_coeff
returns a rational when implementation
is "dense_copy"
or "sparse_copy"
. Unfortunately, the rational field of sage is currently not able to take as input the type Fraction
from Python library fractions
.
A solution could be
from statistics import mean
from fractions import Fraction
return QQ(str(mean([Fraction(str(x)) for x in G.clustering_coeff(implementation=implementation).values()])))
OK. Sounds like having the result in QQ is important to you.
... so instead of going through Fraction
I have just replaced it by sum
followed by division
Branch pushed to git repo; I updated commit sha1. New commits:
03e926f | src/sage/stats/basic_stats.py: Update doctest outputs with deprecation warnings |
Reviewer: David Coudert
LGTM. Thank you.
Thanks!
...
:
DeprecationWarning: sage.stats.basic_stats.mean is deprecated; use numpy.mean or numpy.nanmean instead
See https://trac.sagemath.org/29662 for details.
**********************************************************************
1 item had failures:
1 of 14 in sage.crypto.lwe.LWE.__init__
[111 tests, 1 failure, 0.32 s]
----------------------------------------------------------------------
sage -t --long --warn-long 46.3 --random-seed=0 src/sage/crypto/lwe.py # 1 doctest failed
----------------------------------------------------------------------
Branch pushed to git repo; I updated commit sha1. New commits:
5966df9 | src/sage/crypto/lwe.py: Replace use of sage.stats.basic_stats in doctest by numpy |
The patchbot reports several errors due to mean
sage -t --long --random-seed=0 src/sage/stats/distributions/discrete_gaussian_lattice.py # 3 doctests failed
sage -t --long --random-seed=0 src/doc/en/prep/Quickstarts/Statistics-and-Distributions.rst # 2 doctests failed
sage -t --long --random-seed=0 src/sage/stats/distributions/discrete_gaussian_polynomial.py # 1 doctest failed
sage -t --long --random-seed=0 src/sage/coding/information_set_decoder.py # 2 doctests failed
sage -t --long --random-seed=0 src/sage/graphs/graph_generators_pyx.pyx # 1 doctest failed
Description changed:
---
+++
@@ -1,9 +1,11 @@
Right now in [sage stats](http://doc.sagemath.org/html/en/reference/stats/index.html) there are a few very basic things, and then lots of very technical stuff sampling from e.g. polynomials or for hidden markov modeling!
-This ticket is to split the more technical stuff (which presumably may still be used for researchers, but not for the sort of things basic R or pandas data frames would be) into a separate module where it can be taken care of.
+In this ticket, we deprecate `sage.stats.basic_stats`, which is underdeveloped. Users are usually better off to learn to use other facilities available in Python that provide better coverage and consistency. For example, `mean` happens to work with vectors (which is used in some doctests elsewhere in sage), but `variance` does not.
-Note that Python 3 comes with built-in stats module.
-https://docs.python.org/3/library/statistics.html
+Note that Python 3 comes with a basic [built-in stats module](https://docs.python.org/3/library/statistics.html), but much of its functionality is incompatible with Sage's numbers in #28234.
+
+In the deprecation messages, we instead point users to suitable `numpy` and `scipy` functions, as well as to `pandas` (for `moving_average`).
+
See also #29663
Branch pushed to git repo; I updated commit sha1. New commits:
88745e1 | src/sage/stats/distributions/discrete_gaussian_lattice.py: Remove use of sage.basic_stats.mean in doctests |
Branch pushed to git repo; I updated commit sha1. New commits:
fb10839 | src/sage/stats/distributions/discrete_gaussian_polynomial.py: Remove use of sage.stats.basic_stats.mean in doctest |
768ebdf | src/sage/graphs/graph_generators_pyx.pyx: Remove use of sage.stats.basic_stats.mean in doctest |
c6e40a5 | src/sage/coding/information_set_decoder.py: Remove use of sage.stats.basic_stats; make imports more specific |
Branch pushed to git repo; I updated commit sha1. New commits:
7e7331a | src/doc/en/prep/Quickstarts/Statistics-and-Distributions.rst: Update, stop recommending deprecated functions |
I think it's OK now. We have a few pyflake warnings that can be fixed afterward.
Thanks for reviewing!
Changed branch from u/mkoeppe/clarify_stats_module_role to 7e7331a
The definition of "numpy.mean" is different than Sage's built in mean:
sage: mean([vector([1,2]),vector([3,4])])
<ipython-input-4-37be8bb3881f>:1: DeprecationWarning: sage.stats.basic_stats.mean is deprecated; use numpy.mean or numpy.nanmean instead
See https://trac.sagemath.org/29662 for details.
mean([vector([Integer(1),Integer(2)]),vector([Integer(3),Integer(4)])])
(2, 3)
sage: numpy.mean([vector([1,2]),vector([3,4])])
2.5
Also, one of the examples in the Sage docs involving mean doesn't work with numpy.mean:
sage: v = stats.TimeSeries([1..10])
sage: mean(v)
<ipython-input-13-9d6545956b1d>:1: DeprecationWarning: sage.stats.basic_stats.mean is deprecated; use numpy.mean or numpy.nanmean instead
See https://trac.sagemath.org/29662 for details.
mean(v)
5.5
sage: numpy.mean(v)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-14-2e91276b932b> in <module>
----> 1 numpy.mean(v)
<__array_function__ internals> in mean(*args, **kwargs)
/ext/sage/9.5/local/var/lib/sage/venv-python3.9.9/lib/python3.9/site-packages/numpy/core/fromnumeric.py in mean(a, axis, dtype, out, keepdims, where)
3436 pass
3437 else:
-> 3438 return mean(axis=axis, dtype=dtype, out=out, **kwargs)
3439
3440 return _methods._mean(a, axis=axis, dtype=dtype,
TypeError: TimeSeries.mean() takes no keyword arguments
Maybe this is an argument to improve variance, rather than delete it?
For example, mean happens to work with vectors (which is used in some doctests elsewhere in sage), but variance does not.
Changed commit from 7e7331a
to none
perhaps removing mean
, variance
, etc. from global namespace would be enough for the time being.
Replying to @dimpase:
perhaps removing
mean
,variance
, etc. from global namespace would be enough for the time being.
I think that having sage versions of mean
and variance
in the global namespace are useful to many users. Much more than EllipticCurve
let say.
numpy and python are floating point focused when it comes to mathematics. This is not what sage tries to teach.
By default, EllipticCurve
, or in fact anything that's not in vanilla Python, should not be in global namespace.
This is in line with other Python packages.
By the way, one of the worst is currently sum
, which overloads Python's sum at Sage prompt, but does not do this in the library.
Having said that, nothing then would prevent one from having "classical Sage" module, which populates the global namespace with stuff.
Replying to @dimpase:
By default,
EllipticCurve
, or in fact anything that's not in vanilla Python, should not be in global namespace. This is in line with other Python packages.
Indeed, and then with from sage.all import *
people can get a standard set of stuff in their global namespace. And the sage
command environment should probably preload that, just as the preparser. It's great if sage can work like a python package, but the sage-specific environment is very valuable for quick stuff. And for, for instance, teaching, it is very useful if there is a useful "standard" set, if only so that it's easily available in sagecell
.
It's valuable that in maple, maxima, mathematica, you can start right away with typing in math one-liners and get results, without having to execute boiler plate. We have that in sagemath now as well. I think it would reduce the usability of sagemath in many informal settings if we were to do away with it. A python library is just not a very good match for what people expect from an interactive CAS, so I think it's good we have a shim layer that makes sagemath behave a little more like a traditional CAS.
Discussions after this ticket was merged:
Follow-up ticket: #33432
Follow-up ticket: #33453
Right now in sage stats there are a few very basic things, and then lots of very technical stuff sampling from e.g. polynomials or for hidden markov modeling!
In this ticket, we deprecate
sage.stats.basic_stats
, which is underdeveloped. Users are usually better off to learn to use other facilities available in Python that provide better coverage and consistency. For example,mean
happens to work with vectors (which is used in some doctests elsewhere in sage), butvariance
does not.Note that Python 3 comes with a basic built-in stats module, but much of its functionality is incompatible with Sage's numbers in #28234.
In the deprecation messages, we instead point users to suitable
numpy
andscipy
functions, as well as topandas
(formoving_average
).See also #29663
CC: @dcoudert @tscrim
Component: statistics
Author: Matthias Koeppe
Branch:
7e7331a
Reviewer: David Coudert
Issue created by migration from https://trac.sagemath.org/ticket/29662