pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.19k stars 17.77k forks source link

weighted mean #10030

Open bgrayburn opened 9 years ago

bgrayburn commented 9 years ago

A "weights" labeled parameter in the df.mean method would be extremely useful. In numpy this functionality is provided via np.average instead of np.mean which I'm assuming is how similar functionality would be added to pandas.

ex requested feature:

> a = np.array([[2,4,6],[8,10,12]])
> w = [.5, 0, .5]
> np_avg = np.average(a, weights = w, axis=1)
#output ->  array([ 4., 10.])
> pd_avg = pandas.DataFrame(a).mean(weights = w, axis=1)
#desired output -> series with entries 4 and 10

If this is a desired feature I'll complete this and submit a pull request if desired. If this is somewhere already I've overlooked, my apologies, I've tried to look thoroughly.

shoyer commented 9 years ago

Why not just write:

pd_avg = (np.array(w) * pandas.DataFrame(a)).mean(axis=1)
benjello commented 9 years ago

I agree with @bgrayburn, weighted statistics would be very useful in pandas. One can use statsmodel but extending DataFrame methods to use weight would be very useful for people using weighted survey data.

bgrayburn commented 9 years ago

@Stephan : I agree your code snippet accomplishes (nearly) the same thing from a functional perspective, but from a code-readability standpoint, and a code reuse standpoint, including a weights parameter seems optimal. Also numpy's weight parameter automatically normalizes the weights vector which is also extremely useful.

On Sat, May 2, 2015 at 3:04 PM, Mahdi Ben Jelloul notifications@github.com wrote:

I agree with @bgrayburn https://github.com/bgrayburn, weighted statistics would be very useful in pandas. One can use statsmodel but extending DataFrame methods to use weight would be very useful for people using weighted survey data.

— Reply to this email directly or view it on GitHub https://github.com/pydata/pandas/issues/10030#issuecomment-98386585.

shoyer commented 9 years ago

Okay, fair enough. This seems within scope for pandas. We recently added a sample method which includes a similar weights argument (#9666) which might be useful as a starting point.

benjello commented 9 years ago

@shoyer asked elsewhere (#10000) about a list of methods that could be enhanced by a 'weighted' version. Almost all the statistical functions that can be found here are candidates. I can also think of describe, value_counts, qcut, hist and margin computations in pivot tables.

shoyer commented 9 years ago

Again, I think we're open to most of these changes (all of which are backward compatible with weights=None). The main obstacle is that we need implementations, documentation and benchmarks to show we aren't slowing anything down. PRs would be welcome.although, It would also be worth checking if any of these could be pushed upstream to numpy.

bgrayburn commented 9 years ago

@shoyer @benjello sorry for the delay on this, still planning on submitting a PR, I'll be coding this weekend.

in regards to the numpy thing, for weighted means they use .average which you can see here. My plan was to implement

pd_avg = (np.array(w) * pandas.DataFrame(a)).mean(axis=1)

pretty much as written, by multiplying the input dataframe's columns by the weight vector. Alternatively we could call np.average when a weights parameter is present, OR (3rd option) we could implement a pandas.DataFrame(a).average(weights=[...]) to mirror pandas.

one last question, should weighting be applicable in either axis=0 or axis=1 mode? I'm assuming yes, but wanted to check.

Let me know your preferences, or if somehow this should be incorporated with a larger change as mentioned above.

Best

jreback commented 9 years ago

why are you not just adding a weights keyword to .mean?

much more consistent in the API and we don't implement average I suspect because it's just confusing

jreback commented 9 years ago

and this needs to trickle down to nanops.py where all of the actual computation is done - this handles many different dtypes

benjello commented 9 years ago

@bgrayburn : i think @jreback suggestion is worth following: using a mean with weights is what you re expecting for a weighted mean

mattayes commented 9 years ago

+1 Would make working with microdata samples (think PUMS) so much nicer.

shoyer commented 9 years ago

I agree that it would be better to incorporate this directly into aggregation functions like mean and var instead of adding specialized methods like average.

johne13 commented 8 years ago

+1 to @mattayes & @shoyer -- when working with weighted data you pretty much want to weight EVERY statistic and graph that you generate. It's pretty much a necessity to have some weighting option if you're working with such data.

Adding a weights argument to as many functions as possible over time sounds like the way to go to the extent it isn't going to be handled in numpy/statsmodels/matplotlib.

Stata, for example, allows a weight option for practically all functions. I use stata's tabstat with the weight option very frequently and at the moment there isn't any good analog in pandas that I know of.

johne13 commented 8 years ago

A possible complication to consider: there are potentially different kinds of weights. Stata, for example, defines 4 types of weights: frequency, analytical, probability, and importance (although the last one is just an abstract catchall). [http://www.stata.com/help.cgi?weight]

I'm thinking that in this thread most people are thinking of frequency weights, but it might be necessary to clarify this. Also, it probably won't matter for something like mean or median, but could affect something like variance.

shoyer commented 8 years ago

There has been some recent discussion about implementing efficient algorithms for weighted partition (e.g., to do weighted median) upstream in NumPy, as well: https://mail.scipy.org/pipermail/numpy-discussion/2016-February/075000.html

In any case, a first draft that uses sorting to do weighted median would still be valuable.

max-sixty commented 8 years ago

From https://github.com/pydata/xarray/pull/650:

How about designing this as a groupby-like interface? In the same way as .rolling (or .expanding & .ewm in pandas)?

So for example ds.weighted(weights=ds.dim).mean().

And then this is extensible, clean, pandan-tic.

jreback commented 8 years ago

what other things would you do with a .weighted(..).mean() interface?

IOW what other parameters would it accept aside from the actual weights?

shoyer commented 8 years ago

@jreback I think .weighted() would only accept weights, which could be either an array or a callable of the usual form (lambda df: ....). But the WeightedMethods class could also expose weighted implementations of other methods, such as std, var, median, sum, value_counts, hist, etc. I would even consider moving over sample and deprecating the weights argument.

jreback commented 8 years ago

@shoyer I can see some syntax from that e.g.

df.weighted('A').B.mean() is pretty clear

though df.B.mean(weights=df.A) is just as clear, so looking for a case where this is significantly nicer.

any idea how/does R do this? (julia?)

benjello commented 8 years ago

I used R wtd.stats.

wtd.mean(x, weights=NULL, normwt="ignored", na.rm=TRUE)
wtd.var(x, weights=NULL, normwt=FALSE, na.rm=TRUE)
wtd.quantile(x, weights=NULL, probs=c(0, .25, .5, .75, 1), 
             type=c('quantile','(i-1)/(n-1)','i/(n+1)','i/n'), 
             normwt=FALSE, na.rm=TRUE)
wtd.Ecdf(x, weights=NULL, 
         type=c('i/n','(i-1)/(n-1)','i/(n+1)'), 
         normwt=FALSE, na.rm=TRUE)
wtd.table(x, weights=NULL, type=c('list','table'), 
          normwt=FALSE, na.rm=TRUE)
wtd.rank(x, weights=NULL, normwt=FALSE, na.rm=TRUE)
wtd.loess.noiter(x, y, weights=rep(1,n), robust=rep(1,n), 
                 span=2/3, degree=1, cell=.13333, 
                 type=c('all','ordered all','evaluate'), 
                 evaluation=100, na.rm=TRUE)
jreback commented 8 years ago

@benjello hmm, that's interesting.

shoyer commented 8 years ago

though df.B.mean(weights=df.A) is just as clear, so looking for a case where this is significantly nicer.

On the face of it, this does look as nice. But from an API design perspective, adding a keyword argument for weights is much less elegant.

Being "weighted" is orthogonal to the type of statistical calculation. With this proposal, instead of adding the weights keyword argument to N different methods, we define a single weighted method, and add statistical methods to it that exactly match the signature of the same methods on DataFrame/Series. This makes it obvious that all these methods share the same approach, and keeps method signatures from growing additional arguments that trigger entirely independent code paths (which is a sign of code smell).

Separately: perhaps weightby is a slightly better name than weighted? It suggests more similarity to groupby.

jreback commented 8 years ago

yeah I think I could get behind .weightby(...). This is not that difficult as most of the machinery already exists....... so if someone wants to to a prototype ...!

kawochen commented 8 years ago

Being "weighted" is orthogonal to the type of statistical calculation.

Can you expand on that?

How are we defining weights here? Are negative weights accepted (I think the domain could depend on the function)? Must weights be of the same length as the data points? Does this API do any validation or is that deferred to the stats functions (if weights can be a lambda, I imagine it's the latter)?

shoyer commented 8 years ago

How are we defining weights here? Are negative weights accepted (I think the domain could depend on the function)? Must weights be of the same length as the data points?

My vision for the weights argument is quite similar to how it is currently defined on DataFrame.sample: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sample.html

So weights need to non-negative numbers that are alignable with the object being weighted. They do not need to be prenormalized, though perhaps there should be a keyword argument on weightby to indicate that that doesn't need to be checked (if it's actually a real performance concern). We also might add an axis argument, which allows you to select the axis over weights are applied. This would let weight columns in a DataFrame differently, for example.

Does this API do any validation or is that deferred to the stats functions (if weights can be a lambda, I imagine it's the latter)?

Ideally we could share validation between different methods on weightby. I don't think it particularly matters whether you validate before or after calling the submethod on the WeightByMethods object.

kawochen commented 8 years ago

I see. That makes sense. Places I have encountered negative 'weights' include the calculation of portfolio returns (which is a linear, but not necessarily convex, combination of individual security returns), but that kind of 'weights' is of course less density-like.

randomgambit commented 8 years ago

hi guys,

I think its a great idea! However, when one tries to look more deeply, things are not as simple as expected. For instance, these are the kind of weights used in Stata


    1.  fweights, or frequency weights, are weights that indicate the number of duplicated observations.

    2.  pweights, or sampling weights, are weights that denote the inverse of the probability that the
        observation is included because of the sampling design.

    3.  aweights, or analytic weights, are weights that are inversely proportional to the variance of an
        observation; that is, the variance of the jth observation is assumed to be sigma^2/w_j, where w_j are
        the weights.  Typically, the observations represent averages and the weights are the number of elements
        that gave rise to the average.  For most Stata commands, the recorded scale of aweights is irrelevant;
        Stata internally rescales them to sum to N, the number of observations in your data, when it uses them.

    4.  iweights, or importance weights, are weights that indicate the "importance" of the observation in some
        vague sense.  iweights have no formal statistical definition; any command that supports iweights will
        define exactly how they are treated.  Usually, they are intended for use by programmers who want to
        produce a certain computation.

look also here more a more in-depth comparison http://www.ats.ucla.edu/stat/sas/faq/weights.htm

The bottom line is: it is really worth having a good discussion about the weights before including them.

Do you want to be super flexible, and put the responsibility on the user? Or do you want to have just simple and plain non-negative weights (hence putting a strong limitation on what can be done?)

It all depends on how many keywords do want to add to the pd.weight function.

jreback commented 8 years ago

@randomgambit well a much bigger issue is someone volunteering to implement this. Having the discussion is fine. But someone would need to step up.

bgrayburn commented 8 years ago

@Jeff Reback, yeah sorry, I said I'd do this quite a while ago, but life's gotten the better of me. I'd still do it if I can find time, but anybody else willing should definitely move this forward.

jreback commented 8 years ago

@bgrayburn haha, not picking on you :>

randomgambit commented 8 years ago

hehe fair enough @jreback, but as I told you I dont have the skill to help coding. I can only share my remarks (i am a econ phd who does academic research) if you believe thats useful.

trentoliphant commented 8 years ago

I may not be the best person to try to tackle this - but I am especially interested in having a weighted mean for groupby. Specifically so that it can be included in a groupby().agg() dictionary. I don't want to duplicate anything that anyone else has done on this. So, if anyone can point me to any other resources or specific parts of the code where I can start looking at this- I would be grateful.

randomstuff commented 7 years ago

An issue worth considering is that .weightedby() might need to be combined with .resample(), .groupby() and friends: data.resample("1D").weightedby("count")["size"].mean(), data.groupby("category").weightedby("count")["size"].mean().

jreback commented 7 years ago

ok I have an implementation here

Works nicely (haven't fully tested the API yet, but .sum()/.mean() ok for now).

In [1]: df = DataFrame({'A': [1, 2, 3, 4],
   ...:                 'B': [1, 2, 3, 4]})

In [2]: df
Out[2]: 
   A  B
0  1  1
1  2  2
2  3  3
3  4  4

In [3]: w = df.weightby('A')

In [4]: w.sum()
Out[4]: 
B    3.0
dtype: float64

In [5]: w.mean()
Out[5]: 
B    0.75
dtype: float64

In [6]: w._weights
Out[6]: array([ 0.1,  0.2,  0.3,  0.4])

In [7]: w.sample(2)
Out[7]: 
   B
2  3
1  2

cc @josef-pkt

so I am just doing values * normalized_weights then running thru our routines for sum/mean.

how about higher moments, var, skew, kurt ?

jorisvandenbossche commented 7 years ago

(@jreback sorry for being late, I already wrote this a few days ago but forgot to press 'comment', and now I see you opened a PR)

I personally find the weightby syntax a bit awkward, and I think a weights= argument can be easier to grasp for users. But I see there were some advocates for this syntax above.

I also find the 'reference' (as similar API) to groupby/resample/rolling not really a plus. In my mind, this is a fully different concept as groupby/resample/rolling, which are all a kind of splitting the data in groups, applying a function, and reassembling the results in a new frame (how is splitted and whether there is overlap between groups of course differs between the methods).

randomgambit commented 7 years ago

Hello there! amazing work by @jreback like always. For what its worth (i.e. in case you give a damn about my piece of advice), I also agree with @jorisvandenbossche that the syntax here is a bit unfortunate.

The concept of weighting is computation-dependent. You might want to get a weighted mean, but also a sum without weights.

So something like df.sum(weight = df.myweights) or df.groupby('jeff').mean(weights = df.myweights) appears more natural to me.

jreback commented 7 years ago

I realized I commented on the PR (but put below as well)

see here: https://github.com/pandas-dev/pandas/pull/15031#issuecomment-269992903

This is quite straightforward to change to an API like the following (in fact the actual implementation already is like this, but I pass a hidden kwargs _weights in).

df.sum(...., weights=...) df.groupby(...).sum(...., weights=...) df.sample(....weights=...) (would stay the same)

I would rip out the weight validation code and put it else where (as a function).

The only downside of this it would be awkward to then specify different kinds of weights (like the link in the top section), because we could then need to add other kwargs like iweights, aweights etc. which I think is not nice.

chris-b1 commented 7 years ago

Interesting package addressing weighted calculations here - https://github.com/jsvine/weightedcalcs. Has an api along the lines of weightby.

rbiswas4 commented 7 years ago

I am looking for weighted means in group by aggregates, where the weights are another column of the dataframe. Does this thread include support for this? To make my question explicit:

lcc:

snid    mjd band    flux    weights
obsHistID                   
609734  141374000   60493.416959    lsstg   2.825651e-09    6.442312e+20
609733  141374000   60493.416511    lsstg   2.893961e-09    5.962141e+20
609732  141374000   60493.416062    lsstg   2.834461e-09    6.590458e+20
....
611542  141374000   60495.426047    lssti   6.722778e-09    1.307280e+20
610790  141374000   60494.432074    lsstz   6.619978e-09    6.156260e+19

and I do operations like: grouped = lcc.groupby(['snid', 'band', 'night']) res = grouped.agg(dict(flux=np.mean)) What I really want to do is fast

res = grouped.agg(dict(flux=weightedmean(weights='weights'))

The real problem is that this requires two columns in the aggregate input. I have looked at workarounds like the ones suggested here but I find this to be slow:

In my case when the original dataframe has about ~10000 rows, and 10000 groups, a direct np.mean aggregate times to a best of 1.6 ms per loop, while running through the workaround is ~ 2 mins per loop. Is there an implementation/workaround that will speed this up for me?

ilemhadri commented 7 years ago

Same problem here. At the moment i am using .apply as a workaround but this is slow. Being able to call .agg on groupby objects using more than one column as input would be highly appreciated.

chris-b1 commented 7 years ago

You can improve performance significantly by not using apply, but instead building the calc out of existing vectorized ops. Example for mean below.

In [49]: df = pd.DataFrame({'id': np.repeat(np.arange(100, dtype='i8'), 100),
    ...:                    'v': np.random.randn(10000),
    ...:                    'w': np.random.randn(10000)})

In [46]: %timeit mean1 = df.groupby('id').apply(lambda x: (x['v'] * x['w']).sum() / x['w'].sum())
32.4 ms ± 2.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [47]: %%timeit
    ...: df['interim'] = df['v'] * df['w']
    ...: gb = df.groupby('id')
    ...: mean2 = gb['interim'].sum() / gb['w'].sum()
1.21 ms ± 16.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [48]: np.allclose(mean1, mean2)
Out[48]: True
johne13 commented 6 years ago

This is so complicated between weight types, API, etc. but just to chime in while stuff is fresh in my mind:

From a statistical point of view, basic statistics seem to break down into 2 categories:

  1. means and order statistics (min/max/median/quantiles)
  2. higher order moments (std dev/variance/skew/kurtosis)

I believe that the first category is very straightforward to handle. I'm an amateur statistician at best, but I think there is really only one basic way to calculate mean/max/median etc.

Conversely, std dev & variance are a lot more complicated than you think -- not that the math is that hard but more that "std dev" can mean more than one thing here. Really great article here that lays out the issues: https://www.stata.com/support/faqs/statistics/weights-and-summary-statistics/

For example, if you type these two commands in stata: sum x [fw=weight], detail sum x [aw=weight], detail you'll get the same results for all stats except std & var

Also, to the extent pandas is handing this sort of thing off to statsmodels, they do have a library here that does some weighting for most basic stats (although min & max seem to be missing). See this link for more (a recent answer I wrote at SO using the statsmodel library):

https://stackoverflow.com/questions/17689099/using-describe-with-weighted-data-mean-standard-deviation-median-quantil/47368071#47368071

randomgambit commented 6 years ago

look im sorry but this is largely incorrect. weighted statistics are really basis stuff

johne13 commented 6 years ago

@randomgambit

OK, then which of these is correct?

sum x [fw=weight], detail sum x [aw=weight], detail

kdebrab commented 6 years ago

FWIW: I needed a resampled weighted quantile and implemented it as follows.

def resample_weighted_quantile(frame, weight=None, rule='D', q=0.5):
    if weight is None:
        return frame.resample(rule).apply(lambda x: x.quantile(q))
    else:
        data = [series.resample(rule).apply(_weighted_quantile, weight=weight[col], q=q).rename(col)
                          for col, series in frame.items()]
        return pd.concat(data, axis=1)

def _weighted_quantile(series, weight, q):
    series = series.sort_values()
    cumsum = weight.reindex(series.index).cumsum()
    cutoff = cumsum.iloc[-1] * q
    return series[cumsum >= cutoff].iloc[0]

frame and weight are dataframes with same index and columns. It could probably be optimized, but at least it works.

Heuertje commented 6 years ago

This would be a great addition to pandas!

JacekPliszka commented 3 years ago

I would love to have such functionality - this is one of the things I sorely miss in pandas in comparison to R. My use case needs different weights for different columns. so something like that would be great:

df.groupby(...).mean(weights={'a': df.d, 'b': df.e})

jreback commented 3 years ago

this is already available in master / 1.3, see https://pandas.pydata.org/pandas-docs/dev/user_guide/window.html?highlight=weighted_mean#overview, look for weighted_mean.

JacekPliszka commented 3 years ago

Thank you though I have one small problem with that - I would like to apply it after groupby that does not have a fixed size.

MaxGhenis commented 3 years ago

I'm involved in a project to extend pandas to include weights: https://github.com/PSLmodels/microdf

This allows normal interaction with many pandas functions, including mean and groupby, after defining weights upfront, e.g. (notebook):

import microdf as mdf

d = mdf.MicroDataFrame({"x": [1, 2, 3], "g": ["a", "a", "b"]}, weights=[2, 1, 1])
d.groupby("g").x.mean()

Result:

g
a    1.333333
b    3.000000
dtype: float64

It currently only supports one set of weights, so @JacekPliszka's request would require a post-hoc merge.

Of course, we'd gladly shut it down if pandas included it natively at some point :)