scikit-hep / pyhf

pure-Python HistFactory implementation with tensors and autodiff
https://pyhf.readthedocs.io/
Apache License 2.0
283 stars 84 forks source link

Clarify project scope, target users, and key features in first sections of docs #2064

Open ltalirz opened 1 year ago

ltalirz commented 1 year ago

Summary

Hi,

I just came across pyhf as a reviewer of your NumFOCUS affiliated project application and wanted to share some thoughts that came to mind as someone new to the project.

While it is evident that you have put a lot of thought and effort into the documentation, it still contains lots of language that will be familiar only to high-energy physicists [1]. I think your NumFOCUS application would be a good point in time to stop and think about your target user base. Do you see applications of pyhf outside high-energy physics? If so, what would be examples of those? If not, it is still worth pointing out where you see its strengths. What are the features that distinguish pyhf from other general-purpose fitting tools?

Answering these questions early in both the README and documentation of pyhf could have a significant effect on the uptake of pyhf in the future.

Cheers, Leopold

[1] For example, the first sentence in the documentation currently reads

The HistFactory p.d.f. template [CERN-OPEN-2012-016] is per-se independent of its implementation in ROOT and sometimes, it’s useful to be able to run statistical analysis outside of ROOT, RooFit, RooStats framework.

This is obviously gibberish to anyone outside the HEP community. The SciPy talk on pyhf instead begins with

pyhf is a pure Python statistical fitting library from the high-energy physics community that leverages both tensor library backends as well as automatic differentiation

which seems to be a much more approachable description of what the package does.

Documentation Page Link

https://pyhf.readthedocs.io/en/v0.7.0/

Code of Conduct

matthewfeickert commented 1 year ago

Thanks for the feedback @ltalirz. We truly appreciate the time and thoughtfulness of your feedback! :+1:

While it is evident that you have put a lot of thought and effort into the documentation, it still contains lots of language that will be familiar only to high-energy physicists.

Yes, similar to many NumFOCUS affiliated and sponsored projects pyhf is primarily domain specific to particle physics, and so we use terminology that is common to the target user audience. Every field has jargon, and while I understand basically none of the text on NiBabel's NumFOCUS page it is clearly meant for the neurological community and so isn't gibberish, just domain jargon.

Regardless, we can look at revising information to be more approachable for people who aren't familiar with common statistical procedures and tools in high energy physics (HEP).

I think your NumFOCUS application would be a good point in time to stop and think about your target user base. Do you see applications of pyhf outside high-energy physics? If so, what would be examples of those? If not, it is still worth pointing out where you see its strengths.

Are these questions part of your review of our application or general curiosity? It of course doesn't matter, but it will help to understand at what depth to answer them.

Without knowing this, I'll just say for now that pyhf is meant to be a domain specific tool, and our primary focus is to support experimental and theoretical high energy physics. However, we are currently in the process of trying to help physicists outside of high energy physics understand how a pyhf focused statistical analysis workflow would work for their research questions and if it would be a good fit for their goals and toolchains.

In terms of what distinguishes pyhf from general purpose tooling for the needs of the particle physics community, we tried to touch on this a bit in our response to the

Describe how your project furthers the NumFOCUS mission: https://numfocus.org/community/mission

application question:

https://github.com/pyhf/numfocus-affiliate-project-application/blob/3f3024b3e13e12354763f751a8976f8c48cb421d/README.md?plain=1#L56-L57

in addition to covering the specific details of the statistical models that are common to HEP and not many other domains in the docs and the user guide (and so would be well outside the scope of any general purpose statistical modeling libraries).

Though I think you're asking about what would be the high level 1 sentence summary that a university student or a technical person could read and still get a useful takeaway from just coming across the PyPI page, and yes, there I very much agree with you that

https://github.com/scikit-hep/pyhf/blob/d9355e23ffd4aceb24041c51c697a55fa40a3d94/README.rst?plain=1#L26-L33

could be expanded and given more context. :+1:

These sorts of reminders are helpful for us to make sure that we're not making it unnecessarily difficult for scientists to get going (we can't do anything to simplify the complexity of research questions or the statistical concepts needed, but we can at least make it easier for a new user to orient themselves).

ltalirz commented 1 year ago

Hi @matthewfeickert , thanks for your very quick reply!

Are these questions part of your review of our application or general curiosity?

My questions are just meant as food for thought, and not a specific part of the application review process.

Being a domain-specific tool is of course completely valid and compatible with NumFOCUS's mission. Yet, in my mind, the "be open" and "be kind" principles ask us to keep an open mind about where future users may come from, and I am happy to read in your reply that you are already thinking in this direction.

Since statistical analysis is used in so many scientific fields, I was just curious what makes the methods in pyhf specific to high-energy physics, and whether they could be useful in other fields as well. I assume that other first-time visitors of your documentation or github repo will have the same question, and preparing an answer early in the documentation might be a low-effort high-impact activity.

kratsg commented 1 year ago

Since statistical analysis is used in so many scientific fields, I was just curious what makes the methods in pyhf specific to high-energy physics, and whether they could be useful in other fields as well. I assume that other first-time visitors of your documentation or github repo will have the same question, and preparing an answer early in the documentation might be a low-effort high-impact activity.

This is a valid point and somewhat is not quite underscored in this project. Since the original implementation of the mathematical model(s) used here were only ever in ROOT (a large, monolithic, C++ project with python bindings), the initial goal was to at least extract this out into pure-python and determine what is missing from the scientific python ecosystem. There is an overarching goal to expand the ability for statistical analysis in python by using ROOT as inspiration (with the open-world implementation there), but this project will teach us what to look for and inform us how to best design things moving forward.