scikit-hep / pyBumpHunter

Python implementation of the BumpHunter algorithm used by HEP community.
BSD 3-Clause "New" or "Revised" License
17 stars 9 forks source link

Update #14

Open lovaslin opened 1 year ago

lovaslin commented 1 year ago

Hello !

I would like to propose this new version for pyBumpHunter. It includes new features, such as a fit of the test statistic distribution and 2D signal injection. I also propose several improvements concerning the side-band normalization procedure and the multi-channel combination.

The API as been also revisited. The objective here is to separate the histogramming part and the Bump Hunt part. So I made a new DataHandler class that will handle histograms. It makes thing much clearer/cleaner in the code and easier to understand for users.

For the next release, I propose to move to version 1.0.0. Sine the API is completely different and not retro-compatible, it might appropriate.

Any suggestions or questions concerning the new features and API are of course welcome. I suppose I should remove Python 3.6 and add 3.10 at some point ...

Edit : I removed Python 3.6 since it seems to cause some random bugs.

henryiii commented 1 year ago

FYI, setup-cfg-format can help you keep the classifiers up (though it needs configuration to do so, see the Scikit-HEP developer pages).

jonas-eschle commented 1 year ago

Hi, thanks a lot for the update and it's great that you're moving things forward, also I see that you incorporate a lot of the improvements discussed last time.

There is still a large objection to a lot of the code though: it seems as if you're re-inventing all the wheels that exist. An essential part is to rely on other packages and do as little as possible for yourself. The problem: the more code we write, the more we have to maintain. And if there is very similar code in multiple packages, it means that we need to multiply our efforts without gain.

For example:

Does that make sense? And don't get me wrong: maybe there are good reasons not to use any of the packages. Maybe it's too simple. Maybe this or that, can be, I am not an expert on the code, you are ;)

This also simplifies things in another way: you don't need to provide so much functionality to the user such as plotting a histogram. Because if you're taking and using a UHI histogram, it's too simple for users to plot (i.e. mplhep.histplot(h)). There is no point in writing a wrapper around that (apologies if I misread the part with the histplot, I was looking through the code and maybe I missed a point and it's actually needed).

But to emphasize: there is a lot of good work in it! And there are definitely useful methods. I am just thinking also on the broader picture: some functions can maybe be put to different places (see e.g. for plotting of statistics https://github.com/scikit-hep/hepstats). The more functions you can move to another package and then be used from bump_hunter (convince the author that it fits well into it) or directly use from another package, the less burden you will have to fix and adjust things in the future, but it's the problem of another package more dedicated to the topics. And that's the "goal" of creating a stable package ecosystem.

Or what do you think?

eduardo-rodrigues commented 1 year ago

FWIW I think there are lots of great points made above by Henry and Jonas. You should really try and engage more with the other org admins as it can only help you ... ;-)

These updates are of course most welcome. You should proceed ... Personally I would have much prefered to go with scoped PRs, which are so much easier to (self)read and (self)review. My 2 cents.

lovaslin commented 1 year ago

Hello !

Thank you for your comments and advice. I will try to answer all the points that were made.

Concerning the histograms, maybe moving to hist package would be better. However, I still need to make sure that using such objects is compatible with what I want to do with them. In particular when I build the pseudo-data histograms and compute the local p-values with numpy/scipy.

About the Data_Handler class, the objective is to have a single container with all the data needed to run BumpHunter. It contains the different kinds of histograms for multiple channels. It also ensure that histograms of the same channel have compatible binning/range, which is mandatory in this case. Once the DataHandler is built, it becomes the only argument of the BumpHunter methods, which is also quite convenient. I also like the idea of giving to the user the choice of how they build histograms before giving them to pyBumpHunter. So this DataHandler class can be used as some kind of interface that can convert different formats into the one chosen for internal use. Though this last point might imply a lot of additional work ...

For the plotting methods, I admit that they are not necessarily useful. So maybe they could be removed. I just added them because I found them convenient at use (like some kind of shortcut). Though if I move to hist or boos-histogram, that would probably change.

Now about the fit and global p-values, I tried to use iminuit for both binned and unbienned fit. In the end, for this particular problem, I found that scipy gives more stable results. I dont kwow why, but iminuit seems to have trouble converging in this case, giving a very bad result. I also thought about using zfit, but since I am doing a simple fit and since scipy is doing a good job, it didn't seem necessary. Now, if I move to hist, it could be better.

However, I could use a more precise/efficient way to compute the global p-value out of the fit function. This numerical integration with scipy is probably not the best way. I am open to any suggestion on that point.

About setup-cfg-format, I didn't know about it. I will look into it.

So there are still things to improve. This is why I made this PR to see what people could propose to help me improve things. In any case, I will request a few code reviews before proceeding to the merge.

Thank you again for your advice, and have a nice day !