Implement KDE bootstrapped error estimates and more

atrettin commented 2 years ago

This PR adds the option to estimate errors on KDE'd histograms using the bootstrap_kde class from the KDE module that we already used. It's of course very slow and not recommended to be used in a fit, but it is very useful for fitting hyper surfaces with statistically sound errors. Other changes are:

To the aeff.weight stage:
- Scale errors along with weights in the aeff.weight stage
- This makes it possible to do the weighting after making histograms in a pipeline. In this way, KDE histograms have to be calculated only once before a fit starts, and the histograms are simply scaled with the efficiency.
to the utils.kde stage:
- Add option to stash histograms in the KDE stage. Ties into the point above: the (expensive) KDEs only need to be done once before a fit.
- Calculate KDEs in log-space for variables that are binned logarithmically (i.e. energy). Significantly decreases distortions in the energy dimension.
to the utils.hist stage:
- Add option to calculate unweighted histograms. Makes it possible to look at raw MC counts, which is useful for diagnostics. Can be used to exclude points with a very small number of MC events from Hypersurface fits, where the Gaussian assumption would be badly broken.
to hyper surface code:
- Add option to exclude data points with too little MC or weighted count from the fits.
- Support Pipelines that use KDEs to produce histograms as long as they use bootstrapping.
- Plot bin-wise fits on fixed range. Should probably be an option rather than being hard-coded...
Ran the KDE code through black to make it nicer.

philippeller commented 2 years ago

I think it looks mostly good! The only thing I am a little hesitant with is the stashing mechanism....just wondering if this will bite us at some point.... Do the parameters in stages before actually not change, or do they still change and the changes are being ignored?

Back in the day we made for example Param sets hasable....so one can quite easily detect if any changes were made. So maybe there we could think of a more global mechanism of stashing....maybe?

atrettin commented 2 years ago

@philippeller Yes, the stashing is a bit of a use-at-your-own-risk thing. All changes in previous stages after the first call to apply are ignored. I'm not aware of a way that a stage could know that the weights have changed and require re-running of the KDE... It's tricky because it would need to know just if any parameter in a stage before has changed. If there is a way to do it I would gladly implement it, otherwise we can give out a warning when the setting is enabled about the behavior.

atrettin commented 2 years ago

Or, better, we run a check on the pipeline after instantiation to look for configuration errors and produce an error if the stage before the KDE stage has a parameter that is free while stashing is enabled. Errors are better than warnings, we still have tons of warnings in PISA that just get ignored... (actually an issue on its own, too many warnings)

philippeller commented 2 years ago

I think cleanest would be to implement stashing in general at the pipeline level.

So you would stay: stash at stage X, and then the pipeline will not run anything up to stage X, except when a parameter has changed. What do you think?

It probably should then keep an internal copy of the weights and apply those, just to avoid side effects, but that is low cost

atrettin commented 2 years ago

That sounds like a good idea! Though, I'm wondering how exactly we'd go about it to make it safe in general. The weights are often manipulated in place in a pipeline, so as you said it would have to store an internal copy of the weights and re-apply them. But what if there is more than just "weights" that need to be applied? I guess we can say that we store everything that's in apply_mode... I need to think about this.

philippeller commented 2 years ago

we could make the user specify what keys to be cached! So this way would be explicit.

E.g. in the cfg specify: At stage X cache keys x, and y

philippeller commented 2 years ago

We could do that in the [pipeline] section of the cfg most naturally. Add a optional: stash = {5 : ['weights', 'blah', ...]} or so

atrettin commented 2 years ago

Hey, @philippeller!

Sorry I was too busy with other stuff, now coming back to this. Could we please find a compromise here? I understand that it would be nice to implement caching globally, but right now it's only used for the KDE stage and for a very specific reason. I see the point that it could lead to unexpected behavior if used inadvertently, and I'm prepared to make changes that make it harder to mis-use such as calling the caching flag ignore_free_parameters_in_earlier_stages and also issuing a warning. But I just don't have time to implement caching consistently throughout PISA right now, in a way that is generally applicable, just to end up using it in this one stage.

Could we please just pass the salt this one time?

philippeller commented 2 years ago

sure, i do not want to be an unreasonable pain. Can you create an issue about our discussion, and that we should fix it soon? Then we will merge the PR as is for now

philippeller commented 2 years ago

Ok, just did that. Will merge the PR now

atrettin commented 2 years ago

Thank you!

icecube / pisa

Implement KDE bootstrapped error estimates and more #689