madminer-tool / madminer-workflow

Madminer complete cloud-based analysis
MIT License
4 stars 4 forks source link

Create MLProject file for Madminer Workflow #22

Closed BenGalewsky closed 4 years ago

BenGalewsky commented 4 years ago

As an analyzer I want to run the MadMiner steps out of an MLflow Project so I can take advantage of the service's features

Acceptance Criteria

  1. MLflow runs its steps via an [MLProject yaml file (https://mlflow.org/docs/latest/projects.html#specifying-projects).

Assumptions

  1. Use the Docker Container Environment specification to avoid having to recode everything into Conda
Sinclert commented 4 years ago

From my initial understanding of what we want to accomplish, and how the MLproject file is defined (reference), it seems that only one Docker environment can be defined per MLproject.

Therefore, I can think of two options on how to use MLFlow on this repository:


Option A: All the steps

This approach would encapsulate all the workflow steps (those using Physics, and those using ML capabilities) under the same MLproject. The only Docker image that has the necessary dependencies to run Physics and ML type of steps is the docker-madminer-all image*.

This approach slips off the initial purpose of MLFlow, as we are including an enormous Physics overhead (very heavy dependencies, non-trackable outputs made out of thousands of simulated events...).

* Clarification: the original purpose of this image was to run this Jupyter notebook, where the code was externally provided, so its Dockerfile would need to change in order to copy all Madminer workflow code and scripts.


Option B: Just ML steps

This approach would only encapsulate the ML part of the workflow. The Docker image to define within the MLproject file would be the docker-madminer-ml image.

This approach fits the purpose of MLFlow, which is, as we all know, ML.

The question that this raises is: where does REANA fit in all of this?. From my understanding, there are two ways of making them to co-exist in the same workflow:


Personal opinion:

I think option B makes more sense, as it applies MLFlow to a smaller, and more specific domain (which is the domain it was originally designed for). Now, I don't have yet a clear idea on which REANA integration alternative is better.

Any thoughts and feedback on how to tackle this integration is welcome 😄

BenGalewsky commented 4 years ago

There is also option A.5 - use two MLprojects. Also, I would like to see how this works in MLflow and then see what role REANA plays. It will become more clear as we take the next few steps.

Sinclert commented 4 years ago

Some news on this issue:

We recently decided to split this repository into 3, as contents were organized in 3 separated folders with very few interactions and references across them (see issue https://github.com/scailfin/madminer-workflow/issues/26 for the full explanation).

With the new division of responsibilities, I would argue that Approach B - Alternative 1 is the way to go:

Use REANA to coordinate the whole workflow, where MLFlow will be present on the ML part of it.


I will resume this issue once we have both madminer-workflow-ph and madminer-workflow-ml ready for Yadage execution.

Sinclert commented 4 years ago

The main PR has been merged.

Feel free to close this issue (I cannot).