python / cpython

The Python programming language
https://www.python.org
Other
62.37k stars 29.95k forks source link

Efficiently support weight/frequency mappings in the statistics module #64678

Open ncoghlan opened 10 years ago

ncoghlan commented 10 years ago
BPO 20479
Nosy @rhettinger, @gpshead, @ncoghlan, @stevendaprano, @wm75, @remilapeyre
Dependencies
  • bpo-20478: Avoid inadvertently special casing Counter in statistics module
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = 'https://github.com/stevendaprano' closed_at = None created_at = labels = ['3.8', 'type-feature', 'library'] title = 'Efficiently support weight/frequency mappings in the statistics module' updated_at = user = 'https://github.com/ncoghlan' ``` bugs.python.org fields: ```python activity = actor = 'steven.daprano' assignee = 'steven.daprano' closed = False closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'ncoghlan' dependencies = ['20478'] files = [] hgrepos = [] issue_num = 20479 keywords = [] message_count = 15.0 messages = ['209931', '209973', '209975', '209985', '210038', '210107', '210108', '305166', '305171', '333978', '334037', '334039', '334101', '334102', '334197'] nosy_count = 7.0 nosy_names = ['rhettinger', 'gregory.p.smith', 'ncoghlan', 'steven.daprano', 'oscarbenjamin', 'wolma', 'remi.lapeyre'] pr_nums = [] priority = 'normal' resolution = None stage = 'needs patch' status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue20479' versions = ['Python 3.8'] ```

    ncoghlan commented 10 years ago

    bpo-20478 suggests ensuring that even weight/frequency mappings like collections.Counter are consistently handled as iterables in the current statistics module API.

    However, it likely makes sense to provide public APIs that support efficiently working with such weight/frequency mappings directly, rather than requiring that they be expanded to a full iterable all the time.

    One possibility would be to provide parallel APIs with the _map suffix, similar to the format() vs format_map() distinction in the string formatting APIs.

    stevendaprano commented 10 years ago

    Off the top of my head, I can think of three APIs:

    (1) separate functions, as Nick suggests: mean vs weighted_mean, stdev vs weighted_stdev

    (2) treat mappings as an implied (value, frequency) pairs

    (3) take an additional argument to switch between unweighted and weighted modes.

    I dislike #3, but will consider the others for 3.5.

    a35477f8-71c1-46d8-a00e-b2f89ce712bf commented 10 years ago

    On 2 February 2014 11:55, Steven D'Aprano \report@bugs.python.org\ wrote:

    (1) separate functions, as Nick suggests: mean vs weighted_mean, stdev vs weighted_stdev

    This would be my preferred approach. It makes it very clear which functions are available for working with map style data. It will be clear from both the module documentation and a casual introspection of the module that those APIs are present for those who might want them. Also apart from mode() the implementation of each function on map-format data will be completely different from the iterable version so you'd want to have it as a separate function at least internally anyway.

    serhiy-storchaka commented 10 years ago

    See also bpo-18844.

    db36b1c5-c904-4d78-ae55-de8a610fb115 commented 10 years ago

    -----Ursprüngliche Nachricht----- Von: Steven D'Aprano [mailto:report@bugs.python.org] Gesendet: Sonntag, 2. Februar 2014 12:55 An: wolfgang.maier@biologie.uni-freiburg.de Betreff: [bpo-20479] Efficiently support weight/frequency mappings in the statistics module

    Steven D'Aprano added the comment:

    Off the top of my head, I can think of three APIs:

    (1) separate functions, as Nick suggests: mean vs weighted_mean, stdev vs weighted_stdev

    (2) treat mappings as an implied (value, frequency) pairs

    (2) is clearly my favourite. (1) may work well, if you have a module with a small fraction of functions, for which you need an alternate API. In the statistics module, however, almost all of its current functions could profit from having a way to treat mappings specially. In such a case, (1) is prone to create lots of redundancies.

    I do not share Oscar's opinion that

    apart from mode() the implementation of each function on map-format data will be completely different from the iterable version so you'd want to have it as a separate function at least internally anyway.

    Consider _sum's current code (docstring omitted for brevity): def _sum(data, start=0): n, d = _exact_ratio(start) T = type(start) partials = {d: n} # map {denominator: sum of numerators} # Micro-optimizations. coerce_types = _coerce_types exact_ratio = _exact_ratio partials_get = partials.get # Add numerators for each denominator, and track the "current" type. for x in data: T = _coerce_types(T, type(x)) n, d = exact_ratio(x) partials[d] = partials_get(d, 0) + n if None in partials: assert issubclass(T, (float, Decimal)) assert not math.isfinite(partials[None]) return T(partials[None]) total = Fraction() for d, n in sorted(partials.items()): total += Fraction(n, d) if issubclass(T, int): assert total.denominator == 1 return T(total.numerator) if issubclass(T, Decimal): return T(total.numerator)/total.denominator return T(total)

    all you'd have to do to treat mappings as proposed here is to add a check whether we are dealing with a mapping, then in this case, instead of the for loop:

        for x in data:
            T = _coerce_types(T, type(x))
            n, d = exact_ratio(x)
            partials[d] = partials_get(d, 0) + n

    use this:

        for x,m in data.items():
            T = _coerce_types(T, type(x))
            n, d = exact_ratio(x)
            partials[d] = partials_get(d, 0) + n*m

    and no other changes (though I haven't tested this carefully).

    Wolfgang

    db36b1c5-c904-4d78-ae55-de8a610fb115 commented 10 years ago

    Well, I was thinking about frequencies (ints) when suggesting

        for x,m in data.items():
            T = _coerce_types(T, type(x))
            n, d = exact_ratio(x)
            partials[d] = partials_get(d, 0) + n*m

    in my previous message. To support weights (float or Rational) this would have to be more sophisticated.

    Wolfgang

    a35477f8-71c1-46d8-a00e-b2f89ce712bf commented 10 years ago

    in my previous message. To support weights (float or Rational) this would have to be more sophisticated.

    I guess you'd do:

         for x,w in data.items():
             T = _coerce_types(T, type(x))
             xn, xd = exact_ratio(x)
             wn, wd = exact_ratio(w)
             partials[d] = partials_get(xd * wd, 0) + xn * wn

    Variance is only slightly trickier. Median would be more complicated.

    I just think that I prefer to know when I look at code that something is being treated as a mapping or as an iterable. So when I look at

        d = f(x, y, z)
        v = variance_map(d)

    It's immediately obvious what d is and how the function variance_map is using it.

    As well as the benefit of readability there's also the fact that accepting different kinds of input puts strain on any attempt to modify your code in the future. Auditing the code requires understanding at all times that the name "data" is bound to a quantum superposition of different types of object.

    Either every function would have to have the same "iterable or mapping" interface or there would have to be some other convention for making it clear which ones do. Perhaps the functions that don't make sense for a mapping could explicitly reject them rather than treating them as an iterable.

    I just think it's simpler to have a different function name for each type of input. Then it's clear what functions are available for working with mappings.

    If you were going for something completely different then you could have an object-oriented interface where there are classes for the different types of data and methods that do the right thing in each case.

    Then you would do

        v = WeightedData(d).variance()

    The ordinary variance() function could just become a shortcut for

        def variance(data):
            return SequenceData(data).variance()
    rhettinger commented 6 years ago

    My recommendation is to have *weights* as an optional argument:

        statistics.mean(values, weights=None)

    While it is tempting to special case dicts and counters, I got feedback from Jake Vanderplas and Wes McKinney that in practice it is more common to have the weights as a separate list/array/vector.

    That API has other advantages as well. For starters, it is a simple extension of the existing API, so it isn't a disruptive change. Also, it works well with mapping views:

    statistics.mean(vehicle_sales.keys(), vehicle_sales.values())

    And the API also helps support use cases where different weightings are being explored for the same population:

       statistics.mean(salary, years_of_service)
       statistics.mean(salary, education)
       statistics.mean(salary, age)
    ncoghlan commented 6 years ago

    Thinking back to my signal processing days, I have to agree that our weightings (filter definitions) were usually separate from our data (live signals). Similarly, systems engineering trade studies all maintained feature weights separately from the assessments of the individual options.

    The comment from my original RFE about avoiding expanding value -> weight/frequency mappings "to a full iterable all the time" doesn't actually make much sense in 3.x, since m.keys() and m.values() are now typically able to avoid data copying.

    So +1 from me for the separates "weights" parameter, with the m.keys()/m.values() idiom used to handle mappings like Counter.

    As another point in favour of that approach, it's trivial to build zero-copy weighted variants on top of it for mappings with cheap key and value views:

        def weighted_mean(mapping):
            return statistics.mean(mapping.keys(), mapping.values())

    By contrast, if the lowest level primitive provided is a mapping based API, then when you do have separate values-and-weights iterables, you're going to have a much harder time avoiding building a completely new container.

    23982c60-ed6c-47d1-96c2-69d417bd81b3 commented 5 years ago

    Is this proposal still relevant? If so, I would like to work on its implementation.

    I think the third proposition to change the API to have a new weights parameter is the best has it does not blindly suppose that a tuple is a pair (value, weight) which could not always be true.

    rhettinger commented 5 years ago

    Is this proposal still relevant? If so, I would like to work on its implementation.

    The first question is the important one. Writing implementations is usually the easy part. Deciding on whether there is a real need and creating a usable, extendable, and clean API is often the hard part. So please don't run to a PR, that just complicates the important part.

    stevendaprano commented 5 years ago

    Is this proposal still relevant?

    Yes.

    As Raymond says, deciding on a good API is the hard part. Its relatively simple to change a poor implementation for a better one, but backwards compatibility means that changing the API is very difficult.

    I would find it very helpful if somebody has time to do a survey of other statistics libraries or languages (e.g. numpy, R, Octave, Matlab, SAS etc) and see how they handle data with weights.

    At the moment, a simple helper function seems to do the trick for non-negative integer weights:

    def flatten(items):
        for item in items:
            yield from item

    py> data = [1, 2, 3, 4] py> weights = [1, 4, 1, 2] py> statistics.mean(flatten([x]*w for x, w in zip(data, weights))) 2.5

    In principle, the implementation could be as simple as a single recursive call:

    def mean(data, weights=None):
        if weights is not None:
            return mean(flatten([x]*w for x, w in zip(data, weights)))
        # base case without weights is unchanged

    or perhaps it could be just a recipe in the docs.

    a35477f8-71c1-46d8-a00e-b2f89ce712bf commented 5 years ago

    I would find it very helpful if somebody has time to do a survey of other statistics libraries or languages (e.g. numpy, R, Octave, Matlab, SAS etc) and see how they handle data with weights.

    Numpy has only sporadic support for this. The standard mean function does not have any way to provide weights but there is an alternative called average that computes the mean and has an optional weights argument. I've never heard of average before searching for "numpy weighted mean" just now. Numpy's API often has bits of old cruft from where various numerical packages were joined together so I'm not sure they would recommend their current approach. I don't think there are any other numpy functions for providing weighted statistics.

    Statsmodels does provide an API for this as explained here: https://stackoverflow.com/a/36464881/9450991 Their API is that you create an object with data and weights and can then call methods/attributes for statistics.

    Matlab doesn't support even weighted mean as far as I can tell. There is wmean on the matlab file exchange:

    • what APIs do they provide?
    • do they require weights to be positive integers, or do they support arbitrary float weights?
    • including negative weights? (what physical meaning does a negative weight have?)

    At the moment, a simple helper function seems to do the trick for non-negative integer weights:

    def flatten(items): for item in items: yield from item

    py> data = [1, 2, 3, 4] py> weights = [1, 4, 1, 2] py> statistics.mean(flatten([x]*w for x, w in zip(data, weights))) 2.5

    In principle, the implementation could be as simple as a single recursive call:

    def mean(data, weights=None): if weights is not None: return mean(flatten([x]*w for x, w in zip(data, weights)))

    base case without weights is unchanged

    or perhaps it could be just a recipe in the docs.

    ----------


    Python tracker \report@bugs.python.org\ \https://bugs.python.org/issue20479\


    a35477f8-71c1-46d8-a00e-b2f89ce712bf commented 5 years ago

    Sorry, sent too soon...

    Matlab doesn't support even weighted mean as far as I can tell. There is wmean on the matlab file exchange: https://stackoverflow.com/a/36464881/9450991

    This is a separate function wmean(data, weights). It has to be a separate function though because it's third party code so the author couldn't change the main mean function.

    R ships with a weighted.mean function but I think for standard deviation you need third party libs.

    A quick survey but the main impression I get is that providing API for this is not that common. The only good-looking API is the statsmodel one.

    stevendaprano commented 5 years ago

    Here is some further information on weights in statistics in general, and SAS and Stata specifically:

    https://blogs.sas.com/content/iml/2017/10/02/weight-variables-in-statistics-sas.html

    Quote:

    use the FREQ statement to specify integer frequencies for 
    repeated observations. Use the WEIGHT statement when you 
    want to decrease the influence that certain observations 
    have on the parameter estimates. 

    http://support.sas.com/kb/22/600.html

    https://www.stata.com/manuals13/u20.pdf#u20.23

    Executive summary:

      mean([1, 2, 3, 4], freq=[1, 0, 3, 1])

    would be equivalent to mean([1, 3, 3, 3, 4]).

    https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Mathematical_definition

    but how that extends to more complex SAS functions is unclear to me.

    (And for what its worth, I don't think SAS's MEAN function supports weights at all. Any SAS users here that could comment?)