Sample weight support - Githubissues

kmedved commented 3 years ago

Would it be possible or sensible to add support for sample weights (at the observation level) to this package? Most scikit-learn estimators allow the user to pass a sample weight into the .fit() call (e.g., linear-regression or LightGBM). These is a key characteristic of many regression problems.

Normally this would be pretty simple to implement by just allowing a user to pass a sample_weight into the fit call. But given a Bayesian Bootstrap is already relying on weighting as opposed to resampling, maybe this doesn't make sense in this context.

Tagging @JulianWgs in case he has any thoughts.

Thanks!

lmc2179 commented 3 years ago

Hi! Can you perhaps give an example of the kind of problem you'd like to solve with this? It's not immediately clear what you have in mind, but maybe that will make it a little more clear to me.

kmedved commented 3 years ago

Sure.

I work a lot with sports data (and in fact found this package through this: http://savvastjortjoglou.com/nfl-bayesian-bootstrap.html), where it is common for each row to represent a single game or season by an player, but since the players may play different numbers of games or minutes, I don't want to weight all observations equally in fitting a model. Typically I would handle this by passing a sample_weight into the .fit call of the model (where the weight is the number of games/minutes played).

Other uses include environmental sensor data, where some sensors may collect aggregated data every X days, but X varies between sensors. In such cases, the loss associated with rows in the data which represent fewer days should receive proportionately less weight than rows representing more data.

Finally, it's fairly common to use sample weights in time-series analysis, to apply an exponential decay weight to older observations in proportion to how old they are (thus downweighting their importance in the model, without dropping them entirely). This would likewise normally be done by passing a sample_weight into the fit call.

Under the hood, what scikit estimators are doing is just multiplying the loss associated with each row by the corresponding sample weight. So the math is very simple. But I am somewhat uncertain as to how this would interact with the sampling being done by the Bayesian Bootstrap already,

Thanks for quick reply, and I appreciate any help.

JulianWgs commented 3 years ago

I think the idea is very good, but I didn't find a way to represent this mathematically. The obvious way would just multiplying the Bayesian bootstrap weights and the model weights and then scaling the resulting weight to 1. Is this clean? Are there any papers out there describing something similar? I wouldn't implement something which ought to work, but which is not backed by math and papers.

import numpy as np

bb_weights = np.array([0.25, 0.5, 0.25])
model_weights = np.array([0.25, 0.65, 0.1])

assert np.sum(bb_weights) == 1
assert np.sum(model_weights) == 1

weights = bb_weights*model_weights
assert np.sum(weights) != 1
print(np.sum(weights))

weights /= np.sum(weights)
print(weights)

kmedved commented 3 years ago

That was my instinct as well @JulianWgs - that the same multiplication could be done on top of the bootstrap weight, and I don't see any obvious issues.

But I am unaware of any literature on this point either.

lmc2179 commented 3 years ago

This makes sense to me, I think. In the usual nonparametric bootstrap case, it seems reasonable to resample your data points along with their weights, and the simple scheme here should give you a smoothed-out version of that (and the BB is just a smoothed-out standard bootstrap). It seems like a reasonable approach from my point of view - but I also don't know of any literature that specifically addresses this topic.

kmedved commented 3 years ago

Just found a discussion of this issue here: https://stats.stackexchange.com/questions/88615/reweighting-importance-weighted-samples-in-bayesian-bootstrap.

If I'm reading this correctly (and I may not be - I'm a bit over my head here), then I think @JulianWgs solution is correct. (Need to read down to his 'Edit', where he explains the solution).

JulianWgs commented 3 years ago

Thank you for your research. I wouldnt base the implementation on a Stackoverflow question though. Its weird that we cant find anything on the topic. May be we are not searching with the right key words? May be we could find some information on how to deal with multiple weights. For example you ask two experts for their weights and then you need to combine them.

If theres actually no research on the topic, I would start doing some assumption tests: 1) Double weighting with two random and different Dirichlet distribution should result in the same posterior. 2) The most frequent value of the posterior should always equal the frequentist estimate of the non-Bayesian version (model weights have been applied to both). 3) Setting model weights to zero and the rest equally should result in the same posterior as a unweighted Bayesian bootstrap without these samples.

For me the hardest thing to wrap my head around is the influence of the model weights on to the width of the posterior.

I currently don‘t have the time to execute the tests. What do you think of them?

Greetings Julian

kmedved commented 3 years ago

Good ideas. Approach 2 seems to seems to be the most straightforward to test. I will give it a try and see what it looks like with a high enough number of samples.

lmc2179 / bayesian_bootstrap

Sample weight support #17