BenGalewsky commented 4 years ago

As an analyzer I want to run the MadMiner steps out of an MLflow Project so I can take advantage of the service's features

Acceptance Criteria

MLflow runs its steps via an [MLProject yaml file (https://mlflow.org/docs/latest/projects.html#specifying-projects).

Assumptions

Use the Docker Container Environment specification to avoid having to recode everything into Conda

Sinclert commented 4 years ago

From my initial understanding of what we want to accomplish, and how the MLproject file is defined (reference), it seems that only one Docker environment can be defined per MLproject.

Therefore, I can think of two options on how to use MLFlow on this repository:

Option A: All the steps

This approach would encapsulate all the workflow steps (those using Physics, and those using ML capabilities) under the same MLproject. The only Docker image that has the necessary dependencies to run Physics and ML type of steps is the docker-madminer-all image*.

This approach slips off the initial purpose of MLFlow, as we are including an enormous Physics overhead (very heavy dependencies, non-trackable outputs made out of thousands of simulated events...).

* Clarification: the original purpose of this image was to run this Jupyter notebook, where the code was externally provided, so its Dockerfile would need to change in order to copy all Madminer workflow code and scripts.

Option B: Just ML steps

This approach would only encapsulate the ML part of the workflow. The Docker image to define within the MLproject file would be the docker-madminer-ml image.

This approach fits the purpose of MLFlow, which is, as we all know, ML.

The question that this raises is: where does REANA fit in all of this?. From my understanding, there are two ways of making them to co-exist in the same workflow:

Alternative 1: Use REANA to coordinate the whole workflow, where MLFlow will be present on the ML part of it.
Alternative 2: Split REANA and MLFlow usage, so that the Physics part runs on REANA, and the ML part runs on MLFlow. This is interesting but it introduces one issue: how to coordinate Input <--> Output between the last step of REANA, and the first step of MLFlow 🤔

Personal opinion:

I think option B makes more sense, as it applies MLFlow to a smaller, and more specific domain (which is the domain it was originally designed for). Now, I don't have yet a clear idea on which REANA integration alternative is better.

Any thoughts and feedback on how to tackle this integration is welcome 😄

BenGalewsky commented 4 years ago

There is also option A.5 - use two MLprojects. Also, I would like to see how this works in MLflow and then see what role REANA plays. It will become more clear as we take the next few steps.

Sinclert commented 4 years ago

Some news on this issue:

We recently decided to split this repository into 3, as contents were organized in 3 separated folders with very few interactions and references across them (see issue https://github.com/scailfin/madminer-workflow/issues/26 for the full explanation).

With the new division of responsibilities, I would argue that Approach B - Alternative 1 is the way to go:

Use REANA to coordinate the whole workflow, where MLFlow will be present on the ML part of it.

I will resume this issue once we have both madminer-workflow-ph and madminer-workflow-ml ready for Yadage execution.

Sinclert commented 4 years ago

The main PR has been merged.

Feel free to close this issue (I cannot).

madminer-tool / madminer-workflow

Create MLProject file for Madminer Workflow #22

Acceptance Criteria

Assumptions

Option A: All the steps

Option B: Just ML steps

Personal opinion: