Open ncoghlan opened 10 years ago
bpo-20478 suggests ensuring that even weight/frequency mappings like collections.Counter are consistently handled as iterables in the current statistics module API.
However, it likely makes sense to provide public APIs that support efficiently working with such weight/frequency mappings directly, rather than requiring that they be expanded to a full iterable all the time.
One possibility would be to provide parallel APIs with the _map suffix, similar to the format() vs format_map() distinction in the string formatting APIs.
Off the top of my head, I can think of three APIs:
(1) separate functions, as Nick suggests: mean vs weighted_mean, stdev vs weighted_stdev
(2) treat mappings as an implied (value, frequency) pairs
(3) take an additional argument to switch between unweighted and weighted modes.
I dislike #3, but will consider the others for 3.5.
On 2 February 2014 11:55, Steven D'Aprano \report@bugs.python.org\ wrote:
(1) separate functions, as Nick suggests: mean vs weighted_mean, stdev vs weighted_stdev
This would be my preferred approach. It makes it very clear which functions are available for working with map style data. It will be clear from both the module documentation and a casual introspection of the module that those APIs are present for those who might want them. Also apart from mode() the implementation of each function on map-format data will be completely different from the iterable version so you'd want to have it as a separate function at least internally anyway.
See also bpo-18844.
-----Ursprüngliche Nachricht----- Von: Steven D'Aprano [mailto:report@bugs.python.org] Gesendet: Sonntag, 2. Februar 2014 12:55 An: wolfgang.maier@biologie.uni-freiburg.de Betreff: [bpo-20479] Efficiently support weight/frequency mappings in the statistics module
Steven D'Aprano added the comment:
Off the top of my head, I can think of three APIs:
(1) separate functions, as Nick suggests: mean vs weighted_mean, stdev vs weighted_stdev
(2) treat mappings as an implied (value, frequency) pairs
(2) is clearly my favourite. (1) may work well, if you have a module with a small fraction of functions, for which you need an alternate API. In the statistics module, however, almost all of its current functions could profit from having a way to treat mappings specially. In such a case, (1) is prone to create lots of redundancies.
I do not share Oscar's opinion that
apart from mode() the implementation of each function on map-format data will be completely different from the iterable version so you'd want to have it as a separate function at least internally anyway.
Consider _sum's current code (docstring omitted for brevity): def _sum(data, start=0): n, d = _exact_ratio(start) T = type(start) partials = {d: n} # map {denominator: sum of numerators} # Micro-optimizations. coerce_types = _coerce_types exact_ratio = _exact_ratio partials_get = partials.get # Add numerators for each denominator, and track the "current" type. for x in data: T = _coerce_types(T, type(x)) n, d = exact_ratio(x) partials[d] = partials_get(d, 0) + n if None in partials: assert issubclass(T, (float, Decimal)) assert not math.isfinite(partials[None]) return T(partials[None]) total = Fraction() for d, n in sorted(partials.items()): total += Fraction(n, d) if issubclass(T, int): assert total.denominator == 1 return T(total.numerator) if issubclass(T, Decimal): return T(total.numerator)/total.denominator return T(total)
all you'd have to do to treat mappings as proposed here is to add a check whether we are dealing with a mapping, then in this case, instead of the for loop:
for x in data:
T = _coerce_types(T, type(x))
n, d = exact_ratio(x)
partials[d] = partials_get(d, 0) + n
use this:
for x,m in data.items():
T = _coerce_types(T, type(x))
n, d = exact_ratio(x)
partials[d] = partials_get(d, 0) + n*m
and no other changes (though I haven't tested this carefully).
Wolfgang
Well, I was thinking about frequencies (ints) when suggesting
for x,m in data.items():
T = _coerce_types(T, type(x))
n, d = exact_ratio(x)
partials[d] = partials_get(d, 0) + n*m
in my previous message. To support weights (float or Rational) this would have to be more sophisticated.
Wolfgang
in my previous message. To support weights (float or Rational) this would have to be more sophisticated.
I guess you'd do:
for x,w in data.items():
T = _coerce_types(T, type(x))
xn, xd = exact_ratio(x)
wn, wd = exact_ratio(w)
partials[d] = partials_get(xd * wd, 0) + xn * wn
Variance is only slightly trickier. Median would be more complicated.
I just think that I prefer to know when I look at code that something is being treated as a mapping or as an iterable. So when I look at
d = f(x, y, z)
v = variance_map(d)
It's immediately obvious what d is and how the function variance_map is using it.
As well as the benefit of readability there's also the fact that accepting different kinds of input puts strain on any attempt to modify your code in the future. Auditing the code requires understanding at all times that the name "data" is bound to a quantum superposition of different types of object.
Either every function would have to have the same "iterable or mapping" interface or there would have to be some other convention for making it clear which ones do. Perhaps the functions that don't make sense for a mapping could explicitly reject them rather than treating them as an iterable.
I just think it's simpler to have a different function name for each type of input. Then it's clear what functions are available for working with mappings.
If you were going for something completely different then you could have an object-oriented interface where there are classes for the different types of data and methods that do the right thing in each case.
Then you would do
v = WeightedData(d).variance()
The ordinary variance() function could just become a shortcut for
def variance(data):
return SequenceData(data).variance()
My recommendation is to have *weights* as an optional argument:
statistics.mean(values, weights=None)
While it is tempting to special case dicts and counters, I got feedback from Jake Vanderplas and Wes McKinney that in practice it is more common to have the weights as a separate list/array/vector.
That API has other advantages as well. For starters, it is a simple extension of the existing API, so it isn't a disruptive change. Also, it works well with mapping views:
statistics.mean(vehicle_sales.keys(), vehicle_sales.values())
And the API also helps support use cases where different weightings are being explored for the same population:
statistics.mean(salary, years_of_service)
statistics.mean(salary, education)
statistics.mean(salary, age)
Thinking back to my signal processing days, I have to agree that our weightings (filter definitions) were usually separate from our data (live signals). Similarly, systems engineering trade studies all maintained feature weights separately from the assessments of the individual options.
The comment from my original RFE about avoiding expanding value -> weight/frequency mappings "to a full iterable all the time" doesn't actually make much sense in 3.x, since m.keys() and m.values() are now typically able to avoid data copying.
So +1 from me for the separates "weights" parameter, with the m.keys()/m.values() idiom used to handle mappings like Counter.
As another point in favour of that approach, it's trivial to build zero-copy weighted variants on top of it for mappings with cheap key and value views:
def weighted_mean(mapping):
return statistics.mean(mapping.keys(), mapping.values())
By contrast, if the lowest level primitive provided is a mapping based API, then when you do have separate values-and-weights iterables, you're going to have a much harder time avoiding building a completely new container.
Is this proposal still relevant? If so, I would like to work on its implementation.
I think the third proposition to change the API to have a new weights
parameter is the best has it does not blindly suppose that a tuple is a pair (value, weight) which could not always be true.
Is this proposal still relevant? If so, I would like to work on its implementation.
The first question is the important one. Writing implementations is usually the easy part. Deciding on whether there is a real need and creating a usable, extendable, and clean API is often the hard part. So please don't run to a PR, that just complicates the important part.
Is this proposal still relevant?
Yes.
As Raymond says, deciding on a good API is the hard part. Its relatively simple to change a poor implementation for a better one, but backwards compatibility means that changing the API is very difficult.
I would find it very helpful if somebody has time to do a survey of other statistics libraries or languages (e.g. numpy, R, Octave, Matlab, SAS etc) and see how they handle data with weights.
At the moment, a simple helper function seems to do the trick for non-negative integer weights:
def flatten(items):
for item in items:
yield from item
py> data = [1, 2, 3, 4] py> weights = [1, 4, 1, 2] py> statistics.mean(flatten([x]*w for x, w in zip(data, weights))) 2.5
In principle, the implementation could be as simple as a single recursive call:
def mean(data, weights=None):
if weights is not None:
return mean(flatten([x]*w for x, w in zip(data, weights)))
# base case without weights is unchanged
or perhaps it could be just a recipe in the docs.
I would find it very helpful if somebody has time to do a survey of other statistics libraries or languages (e.g. numpy, R, Octave, Matlab, SAS etc) and see how they handle data with weights.
Numpy has only sporadic support for this. The standard mean function does not have any way to provide weights but there is an alternative called average that computes the mean and has an optional weights argument. I've never heard of average before searching for "numpy weighted mean" just now. Numpy's API often has bits of old cruft from where various numerical packages were joined together so I'm not sure they would recommend their current approach. I don't think there are any other numpy functions for providing weighted statistics.
Statsmodels does provide an API for this as explained here: https://stackoverflow.com/a/36464881/9450991 Their API is that you create an object with data and weights and can then call methods/attributes for statistics.
Matlab doesn't support even weighted mean as far as I can tell. There is wmean on the matlab file exchange:
- what APIs do they provide?
- do they require weights to be positive integers, or do they support arbitrary float weights?
- including negative weights? (what physical meaning does a negative weight have?)
At the moment, a simple helper function seems to do the trick for non-negative integer weights:
def flatten(items): for item in items: yield from item
py> data = [1, 2, 3, 4] py> weights = [1, 4, 1, 2] py> statistics.mean(flatten([x]*w for x, w in zip(data, weights))) 2.5
In principle, the implementation could be as simple as a single recursive call:
def mean(data, weights=None): if weights is not None: return mean(flatten([x]*w for x, w in zip(data, weights)))
base case without weights is unchanged
or perhaps it could be just a recipe in the docs.
----------
Python tracker \report@bugs.python.org\ \https://bugs.python.org/issue20479\
Sorry, sent too soon...
Matlab doesn't support even weighted mean as far as I can tell. There is wmean on the matlab file exchange: https://stackoverflow.com/a/36464881/9450991
This is a separate function wmean(data, weights)
. It has to be a
separate function though because it's third party code so the author
couldn't change the main mean function.
R ships with a weighted.mean function but I think for standard deviation you need third party libs.
A quick survey but the main impression I get is that providing API for this is not that common. The only good-looking API is the statsmodel one.
Here is some further information on weights in statistics in general, and SAS and Stata specifically:
https://blogs.sas.com/content/iml/2017/10/02/weight-variables-in-statistics-sas.html
Quote:
use the FREQ statement to specify integer frequencies for
repeated observations. Use the WEIGHT statement when you
want to decrease the influence that certain observations
have on the parameter estimates.
http://support.sas.com/kb/22/600.html
https://www.stata.com/manuals13/u20.pdf#u20.23
Executive summary:
Stata defines four different kinds of weights;
SAS defines two, WEIGHT and FREQ (frequency);
SAS truncates FREQ values to integers, with zero or negative meaning that the data point is to be ignored;
Using FREQ is equivalent to repeating the data points. In Python terms:
mean([1, 2, 3, 4], freq=[1, 0, 3, 1])
would be equivalent to mean([1, 3, 3, 3, 4]).
Weights in SAS are implicitly normalised to sum to 1, but some functions allow you to normalise to sum to the number of data points, because it sometimes makes a difference.
It isn't clear to me what the physical meaning of weights in SAS actually is. The documentation is unclear, it *could* as simple as the definition of weighted mean here:
https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Mathematical_definition
but how that extends to more complex SAS functions is unclear to me.
(And for what its worth, I don't think SAS's MEAN function supports weights at all. Any SAS users here that could comment?)
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = 'https://github.com/stevendaprano' closed_at = None created_at =
labels = ['3.8', 'type-feature', 'library']
title = 'Efficiently support weight/frequency mappings in the statistics module'
updated_at =
user = 'https://github.com/ncoghlan'
```
bugs.python.org fields:
```python
activity =
actor = 'steven.daprano'
assignee = 'steven.daprano'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation =
creator = 'ncoghlan'
dependencies = ['20478']
files = []
hgrepos = []
issue_num = 20479
keywords = []
message_count = 15.0
messages = ['209931', '209973', '209975', '209985', '210038', '210107', '210108', '305166', '305171', '333978', '334037', '334039', '334101', '334102', '334197']
nosy_count = 7.0
nosy_names = ['rhettinger', 'gregory.p.smith', 'ncoghlan', 'steven.daprano', 'oscarbenjamin', 'wolma', 'remi.lapeyre']
pr_nums = []
priority = 'normal'
resolution = None
stage = 'needs patch'
status = 'open'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue20479'
versions = ['Python 3.8']
```