icecube / pisa

Monte Carlo-based data analysis
http://icecube.github.io/pisa/
Apache License 2.0
19 stars 47 forks source link

Let each `Stage` transparently declare expected and new container keys upon initialisation #809

Open thehrh opened 4 weeks ago

thehrh commented 4 weeks ago

Currently, it is not exactly obvious to the user which sequence of stage.services can be combined into one working pipeline configuration without digging through a lot of code.

While each Stage has an expected_params attribute in order to communicate which params need to be defined, there is no analogous mechanism for the data fields/container keys based on which a given service makes its computations, and, e.g., modifies weights. These are generally not even documented, in contrast to expected_params. Without all of expected_params for each stage.service, it won't initialise, and neither will a pipeline containing it.

As an example (just one of many), take osc.prob3, which requires each container to contain the keys "true_coszen" and "true_energy". As of now, this is completely undocumented, and it is up to the service itself to deal with missing keys. However, shouldn't checking for missing ones be generically performed by the base Stage instead (compare expected params check)? A simple example is utils.fix_error, which requires the presence of "errors" during compute (undocumented, no exception handling beyond that in Container itself).

Of course, while typically a given data loader, during setup, populates (most) keys required by any possible later stage.service, other services may also add keys on which some subsequent service depends, and not necessarily during setup (for instance, utils.hist adds the previously mentioned "errors" key upon apply).

If a given service declared any added/new keys during initialisation, however, on the one hand it would become apparent that utils.fix_error could follow utils.hist, for example, and on the other we should be able to detect an inconsistent pipeline config upon pipeline initialisation already.

Through the above, it should also in principle become possible to create a default unit test with no additional work for anyone implementing a service, by producing dummy inputs with the required keys (perhaps combined with some rudimentary detection of the domain of definition for each required key, e.g., energy >= 0 GeV, or -1 <= coszen <= 1).

Status of development in https://github.com/thehrh/pisa-1/tree/stage_declare_expected_keys.

Would also close #811