gradhep / center

The center for all things differentiable analysis!
Apache License 2.0
6 stars 1 forks source link

List of HEP primitives to make differentiable #7

Open phinate opened 4 years ago

phinate commented 4 years ago

It was pointed out in the IRIS-HEP analysis systems meeting today that it would be good to compile a list of operations in HEP workflows that are not normally differentiable, e.g. cutting and histogramming data.

The (evolving) list — please add below!

philippwindischhofer commented 4 years ago

Some more complementary points, clearly very biased from my point of view of systematics-driven Higgs analyses:

And another point which is not really a "primitive" in terms of the analysis flow, but is a "primitive" in terms of the operational tasks we perform in practice:

Clearly, the last point is more of a long-term thing compared to the first two, but maybe it's worth keeping it in the back of our minds regardless.

pablodecm commented 4 years ago

I agree with @philippwindischhofer on the need of abstractions that allow for modelling nuisance parameter variations of the two types he mentioned, most of them could be based on simple operations already in most autodiff frameworks:

We used a combination of them in INFERNO. These two types of variations are discussed from a statistical standpoint from first principles in my thesis (e.g. see the Known Unknowns section).

Then there is systematics as modelled by template variations and an interpolation model as done by HistFactory and pyhf. Using the template interpolation is kind of last resource one-size-fits all in this context (and issues with them are evident such as the ones pointed out in 3.1.3.4 Synthetic Likelihood section of my thesis). If you can implement the variation as any of the first two methods, and you make it differentiable, you will be getting a more precise likelihoods and gradients, thus better inference. Hence, the tools and abstractions to make a analysis differentiable could also improve current infererence workflows as a side effect.

That leads me back to abstraction and primitive design, which it ultimately depends on the programming and data model in a sense. I think there are abstractions that are important and have not been mentioned yet:

What do you think?

phinate commented 4 years ago

Thanks for the detailed reply as always @pablodecm :)

  • parameters (and parameter servers/management): we need an abstraction to keep track and manage of statistical and training parameters if we want to have a generic and composable framework, this includes parameters of interest, nuisance parameters, network parameters and other parameters that could be changed or optimised. In some cases, prior/constraint could be also attached to the parameters.

  • simulated data samples, datasets and data generation procedures: again related with @philippwindischhofer answer, if we are gonna use analysis level objectives such as those used by neos and INFERNO, we need a way to manage simulated data from different samples and include things like subsampling routines, controlling event weights and sampling fraction, etc.

I think these are important concepts, but I’m not too sure exactly how you envisage this being done in practice. I’m keen to produce something that’s more of a library than it is a framework, in that we don’t have to anticipate every single use case, and can keep things as lightweight as possible. In that sense, I would think that data generation-related things can be handled on a per-case basis, or we provide another module that handles these kinds of routines (but is explicitly independent software-wise). Possibly a similar story for what you describe with parameter tracking, though I will admit that I’m not 100% sure exactly what you mean there :P

phinate commented 4 years ago

Also @philippwindischhofer I think @alexander-held asked a q about your last point on persisting gradients through storing intermediates and reading them back into memory (and I think @lukasheinrich also mentioned something in the iris-hep meeting about this), which I think boiled down to using custom gradient ops to load in the values of saved gradients? Feel free to share your insights, not sure exactly what conclusions were reached there :)

pablodecm commented 4 years ago

Hi @phinate,

I think to make this future library generally useful we need to build some key abstractions, I mean some abstractions at the same level of abstraction of for example nn.Module (in Flax or PyTorch) or PyTorch DataLoader, related with high-energy physics analysis. They do not have to cover all the use cases but they can be used to built upon and define a basic API of elements and functions. They do not need to be created from scratch, some abstractions could be heavily based on existing ones in our base dependencies and can be very lightweight.

I think for a general differentiable framework in the context of high-energy physics some utils to manage and sample mixtures of events and its weights will be of general usage. I didn't mean to include data generation in the sense of doing actual simulation or cover all cases, that would would be outside of the scope as you mentioned, more in the sense of let us have some basic abstraction for loading data in a way that is compatible with the stuff and methods that we want to include. I think these abstractions are key for a library usable in LHC analyses.

The same applies to parameters, there will be many types of parameters in our differentiable analysis. The first distinction are trainable parameters (neural network weights and other analysis handles) and statistical model parameters. Statistical model parameters in some cases could have prior or constraint, be of interest or nuisance, etc. Again, do not see how we can build a general differential analysis library without abstractions and examples for some of this.

I believe API and primitive design are very hard but very important for a library, of course what we come up with can change and evolve in the future so it should not stop from us from prototyping and advancing but it is important to have it in mind.

alexander-held commented 4 years ago

Regarding the question of storing intermediate results on disk, at least conceptually this should work just fine. Besides saving the values needed for the forward pass, all the relevant gradients could be saved to the file in the same fashion. Then it's a matter of finding an API to handle the saving/loading part. For more complex distributed analyses this is crucial, I can't imagine keeping everything in memory for cluster-scale workflows. This seems like a topic where probably a lot of experience already exists from people training extremely large scale models?

signorgelato commented 4 years ago

In a memory constrained regime, it may make sense to toss away the intermediate results before computing the gradients at the cost of more computation.

pablodecm commented 4 years ago

In my opinion, I do not see much usefulness on storing and loading gradients through disk other than for persisting final results in some cases for which gradients are relevant.

In all use cases of differentiable analyses I can think of, and certainly in the cases of any optimisation procedure, we want to be able to change the parameters and run again the forward pass. You want to do that as fast and seamlessly as possible, thus storing and loading from disk should be avoided as much as possible. I see thus splitting the analyses in several parts and propagating gradients from early steps to the last through files very cumbersome, also to coordinate from a computational standpoint, but maybe I am missing something.

For inter-framework gradient communication, which ideally should also be avoided for the same reason, I could imagine some exchange format for the gradients (could still be in memory) or some wrapping of a function in one framework in the other with custom gradients (see for example thinc.ai framework).

For scalability out-of-memory, I would think that is almost always better to divide the execution by observations/events (what is often called data parallel) and accumulate gradients at when the loss function is reduced. If the computation graph needs intermediate accumulations/reductions, then it gets a bit more involved but it could be done with distributed extensions of the computational frameworks .

phinate commented 4 years ago

Thanks @pablodecm @signorgelato @alexander-held for your thoughts!!

Regarding the contents of a library, I think that that what @pablodecm poses makes sense in practice. What I do think would also make sense is providing these differentiable HEP primitives in a module that is not tied to a framework, such that it can be used in a flexible way. I've made a very rough start on that in this repo.

We can perhaps then come up with abstractions to wrap this and other routines of interest (e.g. parameter management and sampling routines as you suggest) in a different library, which targets more of a framework-type approach with ease of use for physicists. What do you think? :)

Also, I don't have much experience with analysis at scale, so I'm learning a lot from this discussion :D

alexander-held commented 4 years ago

usefulness on storing and loading gradients through disk

For practical applications that start early in the analysis chain, I think it might not be possible to keep everything in memory. Starting early with event selection and calibration procedures, the input can quickly be hundreds of millions of events and file sizes in the TB to PB range. Processing at this scale can only happen in a distributed way I think, and it might not always be possible to directly go from that size down to something small enough to keep in memory for the final inference step without any intermediate steps. Intermediate steps might be good if they somehow allow to update parameters in an intelligent way that minimizes the need to re-run expensive calculation steps.

pablodecm commented 4 years ago

Your repo looks smooth @phinate! :wink:

I think starting with your library idea is probably best and then designing and building a more framework-like library is reasonable given the differences in scope.

Regarding scalability in practical applications @alexander-held, in some cases indeed it will no be possible to keep everything memory and distributed approaches might be needed but there are much lower hanging fruits that can be done in memory or with simple data parallel implementations that can be built in the native autodiff frameworks or using for example a library like dask.

EiffL commented 4 years ago

giphy (1) Just randomly found out about this project through GitHub and wanted to share my reaction when I looked through the issues :-D

I just had one thing to contribute to that discussion regarding making things large scale. We have had a good experience with the Mesh TensorFlow project https://github.com/tensorflow/mesh that we have used to build distributed differentiable cosmological simulations on GPU clusters. I think it's currently the best option for complex model-parallelism (i.e. more than just reduce_sum), but still very much a work in progress, i.e. could use with some help to develop further GPU support, it's mainly geared towards TPUs right now. I'm having a hard time finding people in Cosmology interested in developing further Mesh TF for our kind of applications so let me know if you want to check it out :-D Would be happy to help code up a demo for instance.

GilesStrong commented 4 years ago

@EiffL Thanks for the link, I'd not heard of Mesh before. As you will have seen we're still in the early stages of getting started and figuring out specifications and goals, so it might be a while before we finalise our selection of tools. From a quick look, Mesh sounds like something which might be quite useful, given that we deal with large volumes of data necessitating parallelism.

Out of interest, do you have any links or reviews for differentiable simulations in cosmology? It sounds similar to an ongoing effort in high-energy physics, e.g. https://cranmer.github.io/madminer-tutorial/intro

EiffL commented 4 years ago

Yep :-) Here is a blogpost and code repo.

I realize you guys are still in the early stage, but thought I would mention it as this might be something to think about in the early design stage if eventually you want to build differentiable computations at the thousands of GB scale ;-) You might need to pick a backend framework that supports distributed tensors.

signorgelato commented 4 years ago

There is Google's experimental project JAX (autograd+jit), which incorporates both automatic differentiation (grad) and just in time (jit) compilation with Google's XLA to run Numpy programs on accelerators (GPUs/TPUs). Automatic vectorization (vmap) and single-program multiple data (SPMD) programming of multiple accelerators (pmap) can also be done easily using JAX.