Feature: resampling to uniform distribution

umami-hep / umami-preprocessing

UPP: Umami PreProcessing

1 stars 35 forks source link

Feature: resampling to uniform distribution #32

Open bbullard opened 1 year ago

bbullard commented 1 year ago

There are two resampling methods already implemented (see documentation) that allow different populations of jets to be resampled to match a reference population distribution.

For the task of jet pt/mass regression, it is important to reweight the pt/mass distribution to be uniform. To implement this, a new target option named "uniform" can be supported that uses the same counter/pdf methods, but constructs a flat histogram on the fly based on the specified binning. Some additional protections should be imposed to prevent sampling in empty bins and limits on the number of resamples of a given jet.

nikitapond commented 1 year ago

I have a feeling that any uniform technique will have to be incredibly aggressive, either upsampling a huge amount or only using a tiny fraction of the available jets, unless you're starting with samples with relatively flat distributions, or only running UPP over a select region of phase space. If this is still okay, then I think this is quite an easily implementation, I can try and get it done over the weekend.

samvanstroud commented 1 year ago

Is it really the case that the for target distributions need to be uniform for resgressions? Is this in the literature somewhere? I'd be interested to read about it more. I do agree with @nikitapond that there will for sure be a trade off here with stat loss

bbullard commented 1 year ago

Hi @nikitapond, thanks for volunteering. Functionality that allows the user to control how aggressively to resampling towards uniform would be the most complete. That being said, I realize now that if one wants to just have an increase in statistics in a certain region, this can already be done manually by defining separate regions for parts of a distribution that should be upsampled. This is probably enough to do the performance studies I have in mind, which could better motivate this feature. @samvanstroud I am not aware of anything yet in the literature about this, but naively the regression should perform worse in data that is underrepresented in the training set.

nikitapond commented 1 year ago

Okay, let me know if changing the binning doesn't work.

samvanstroud commented 1 year ago

It's an interesting question, I would have thought the performance was more dictated by the absolute statistics rather than relative. In other words, if you flattened a distribution with tails I would expect that you for sure would get worse in the area with reduced stat, and I'm also not 100% confident that you would improve much in the tails (where you have the same absolute number of training examples, but they now make up a bigger fraction of the overall training set).

Another suggestion would be to apply loss weights to effectively flatten the distribution without loosing and statistics. I recently added this to the classification task in salt, it would be pretty straightforward to add it also for the regression.