Split train and run pipeline

matchms / ms2deepscore

Deep learning similarity measure for comparing MS/MS spectra with respect to their chemical similarity

Apache License 2.0

53 stars 24 forks source link

Split train and run pipeline #132

Closed niekdejonge closed 1 year ago

niekdejonge commented 1 year ago

Since RDKit is needed for training models this should not be a default dependency. We can split the package in two based on if dependencies and code are needed for training or only for running.

Split code on train and run (training can depend on run, but run not on training)
Split tests. Change github actions to run the tests for training from the tests for running. Install only the dependencies needed for this.
Split dependencies. Separate the dependencies needed for training from running.

niekdejonge commented 1 year ago

I realize that this would make it impossible to pip install MS2Deepscore for training, however I think this doe not have to be a problem, since we can just add explanation of cloning an pip install -e .[train] to the readme. Which I think is the correct approach anyway if you want to train models yourself.

florian-huber commented 1 year ago

I think that splitting dependencies (as you did Niek) is a good and not too complicated solution. I would -for now- not split the code itself because there is little gain in it (unless I miss something) but quite some extra work.

niekdejonge commented 1 year ago

Yes I agree. The main benefit would be making it easier to build into other tools. Having a clearly separated code can really help understand quickly what the code is doing and how it can be integrated, since those tools will most of the time only need the run ability and a default model.

There are some plans to build MS2Query (and MS2Deepscore) into MZMine in about a year from now. Since that would mean rewriting in Java it would make sense to only implement the running part and leave the training part as a separate part.

However, clearly separating the code can also be done at that point.

florian-huber commented 1 year ago

There are some plans to build MS2Query (and MS2Deepscore) into MZMine in about a year from now. Since that would mean rewriting in Java it would make sense to only implement the running part and leave the training part as a separate part.

That would be a very good reason to divide code, if necessary. I suggest that we sit together once those plans become more concrete! We can then choose an option that fits best to all needs, e.g. having two separate packages, but maybe also something as ms2deepscore (full version) and ms2deepscore_light (only inference).

niekdejonge commented 1 year ago

With the current developments going on it might indeed also be best to do this once the newer versions of MS2Deepscore have finalized more.

florian-huber commented 1 year ago

I would actually close this (for now). Mostly because matchms now has rdkit as default dependency. In case we face issues when working with other tools (MZMine etc.), we can reopen this or create a new issue.