python / cpython

The Python programming language
https://www.python.org
Other
63.72k stars 30.53k forks source link

Improve docs for NormalDist #82086

Closed e1afdd59-e3e8-4817-96a8-290a8b7436c4 closed 5 years ago

e1afdd59-e3e8-4817-96a8-290a8b7436c4 commented 5 years ago
BPO 37905
Nosy @rhettinger, @mdickinson, @stevendaprano
PRs
  • python/cpython#15486
  • python/cpython#15487
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = 'https://github.com/rhettinger' closed_at = created_at = labels = ['type-feature', '3.8', 'docs'] title = 'Improve docs for NormalDist' updated_at = user = 'https://bugs.python.org/ChristophDeil' ``` bugs.python.org fields: ```python activity = actor = 'Christoph.Deil' assignee = 'rhettinger' closed = True closed_date = closer = 'rhettinger' components = ['Documentation'] creation = creator = 'Christoph.Deil' dependencies = [] files = [] hgrepos = [] issue_num = 37905 keywords = ['patch'] message_count = 9.0 messages = ['350076', '350094', '350100', '350104', '350105', '350437', '350438', '350439', '350612'] nosy_count = 4.0 nosy_names = ['rhettinger', 'mark.dickinson', 'steven.daprano', 'Christoph.Deil'] pr_nums = ['15486', '15487'] priority = 'normal' resolution = 'fixed' stage = 'resolved' status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue37905' versions = ['Python 3.8'] ```

    e1afdd59-e3e8-4817-96a8-290a8b7436c4 commented 5 years ago

    I saw that Python 3.8 will add a NormalDist class: https://docs.python.org/3.8/library/statistics.html#normaldist-objects

    Personally I don't see the value of adding this to the Python standard lib. The natural progression would be to extend and extend, but in the end only duplicate what already exists in scientific Python packages. But Ok, I guess this is not up for debate any more?

    I'd like to make a specific comment on NormalDist.overlap. The rest of NormalDist is very standard, but that method is an oddball. My suggestion is to remove it or to improve the documentation.

    Current docstring: https://github.com/python/cpython/blob/44f2c096804e8e3adc09400a59ef9c9ae843f339/Lib/statistics.py#L959-L991

    And this docs example: https://github.com/python/cpython/commit/318d537daabf2bd5f781255c7e25bfce260cf227#diff-d436928bc44b5d7c40a8047840f55d35R620-R629

    What percentage of men and women will have the same height in two normally distributed populations with known means and standard deviations <http://www.usablestats.com/lessons/normal>_?

    50.3%

    This statement doesn't make sense to me. No two people have the exact same height, I think the answer to this question should be 0%.

    Using

    n = 100_000; sum(m > w for m, w in zip(men.samples(n), women.samples(n))) / n

    I see that for 82% of random (men, women) matches the man will be larger. That's another measure, but still, stating that 50% of men and women have the same height is confusing.

    Note that there is a multitude of PDF overlap measures different from this min(pdf1, pdf2) that I think are much more common in statistics and the physical sciences:

    And note that the references that are given currently are weird (basic statistics textbooks would be appropriate references IMO, or open references like Wikipedia)

    Why add this one overlap measure and expose it under the "overlap" method name?

    My suggestion would be to be conservative and to remove that method again, before releasing it in 3.8. A reference in the docs could be added to other existing third-party codes (e.g. scipy or the uncertainties package) with further functionality, such as being able to handle correlations or multi-dimensional distributions. For this change I'd be happy to send a PR any time.

    Raymond and others interested in this topic - thoughts?

    (note: I wrote a MultiNorm class prototype last year at https://github.com/cdeil/multinorm/blob/master/multinorm.py and now wanted to rewrite it and try to find a good API and thus was interested in this NormalDist class and what functionality it offers)

    e1afdd59-e3e8-4817-96a8-290a8b7436c4 commented 5 years ago

    The Monte Carlo example here has completely unstable results:

    https://github.com/python/cpython/commit/cc353a0cd95d9b0c93ed0b60ba762427a94c790d#diff-d436928bc44b5d7c40a8047840f55d35R633

    If you run it multiple times, you will see that mean is relatively stable, but stddev varies from 10 to 50 to 100. The reason is that in the model there's a division by z, and the z distribution used has values arbitrarily close to zero:

    >>> NormalDist(5, 1.25).cdf(0) * 100_000
    3.16

    Suggest to change to a MC sampling example that isn't as pathological, doesn't involve division by zero. E.g. change the mean of z to 50, or reduce the stddev to 0.125 or some such change in parameters.

    Usually in stats or machine learning books and docs e.g. on statsmodels or scikit-learn etc., for methods where random numbers are involved, the seed is always set to a fixed value, to have reproducible results & docs. Suggest to make that change also here.

    rhettinger commented 5 years ago

    Several thoughts:

    Perhaps, the wording can be improved on the male/female height example. Measured to finite precision, perhaps to the nearest centimeter, there will be overlaps. This is same kind of binning done with chi-square tests to compare how well two distributions match.

    AFAICT, this tool is well-defined, tested, and has legitimate use cases that are easy to achieve in other ways using only standard library tools.

    rhettinger commented 5 years ago

    BTW, I get your concern about the statistics module as a whole. From the point of view of an expert numpy/scipy user, the whole module seems pointless. However, the purpose of the module is to put a useful subset of statistical tools into the hands of everyday Python users who aren't part of that numeric ecosystem (think of the same people who use MS Excel as part of this group). The module doesn't require extra pip installation, an Anaconda distribution, or even knowledge of array broadcasting and whatnot.

    For the past few months, I've been user testing the new components of the statistics module and have had good success. Some of the examples in the docs were born from those interactions.

    I also get your concern about what is usually found in statistics textbooks; however, those books tend to cover a wide range of distributions, include proofs, and heavily weight hypothesis testing. Typically, little space is given to descriptive statistics, q-q plots, or other things that are handy in day-to-day practice.

    The NormalDist class encapsulates a lot of knowledge that is easily forgotten (that variances are additive, how to translate and rescale), or that a constant divided by a normal distribution doesn't give another normal distribution. I've tried this out on otherwise not mathematically inclined users and they've found it to be useful and intuitive. In contrast, the scipy ecosystem presumes much more sophistication.

    rhettinger commented 5 years ago

    Raymond and others interested in this topic - thoughts?

    Please do submit a PR with an improved example for the MonteCarlo simulation. I'm not fond of that example at all. It should be as short as possible while getting the core idea across. But it should be something that doesn't have a simple analytic solution so as to motivate the concept. Go ahead and use a fixed numeric seed as well.

    rhettinger commented 5 years ago

    New changeset 8371799e300475c8f9f967e900816218d3500e5d by Raymond Hettinger in branch 'master': bpo-37905: Improve docs for NormalDist (GH-15486) https://github.com/python/cpython/commit/8371799e300475c8f9f967e900816218d3500e5d

    rhettinger commented 5 years ago

    New changeset 970548c00b366dcb8eb0c2bec0ffcab30ba03aee by Raymond Hettinger (Miss Islington (bot)) in branch '3.8': bpo-37905: Improve docs for NormalDist (GH-15486) (GH-15487) https://github.com/python/cpython/commit/970548c00b366dcb8eb0c2bec0ffcab30ba03aee

    rhettinger commented 5 years ago
    e1afdd59-e3e8-4817-96a8-290a8b7436c4 commented 5 years ago

    Thank you, Raymond!