Include additional inputs and scaling into model

matchms / ms2deepscore

Deep learning similarity measure for comparing MS/MS spectra with respect to their chemical similarity

Apache License 2.0

48 stars 22 forks source link

Include additional inputs and scaling into model #122

Closed florian-huber closed 1 year ago

florian-huber commented 1 year ago

Discussed with @djoas:

if a model includes additional parameters they sometimes require scaling (e.g. precursor_mz is better scaled down to smaller values for instance by dividing by 1000).
Right now this is done manually and hard to reproduce if a model is shipped.

Solution: Include additional_inputs (or additional_features) as a model parameter that contains both the metadata field names and a scaling factor. Could be a list of tuples or a dictionary.

djoas commented 1 year ago

I would make the suggestion to use a dictionary like so:

additional_input = [{"feature_name": "precursor_mz", "scaling": 0.001}] spectrum_binner = SpectrumBinner(bins, mz_min=10, mz_max=1000, peak_scaling=0.5, additional_metadata=additional_input)

With the SpectrumBinner beeing part of a saved Model, we can save and load it within the SpectrumBinner and it also automatically applies the scaling when predicting.

Additionally the DataGenerator should be adjusted so: data_generator = DataGeneratorAllInchikeys(binned_spectrums, ..., additional_input=additional_input) is still possible.

florian-huber commented 1 year ago

This looks like it would work. It would use a list of dictionaries, which would do the job. But it is not very common, probably because there is a lot of redundancy due to the repetition of the dictionary keys: additional_input = [{"feature_name": "precursor_mz", "scaling": 0.001}, {"feature_name": "retention_time", "scaling": 0.01}]

An alternative could be to store it as additional_input = {"precursor_mz": 0.001, "retention_time": 0.01} That is much more compact, but of course, requires that people get to know that it expects field-name/scaling pairs (fine with me).

djoas commented 1 year ago

I have pushed an update to the additional_input_parameters branch according to your approach. It seems that matchms had a change in the BaseSimilarity.matrix parameters, so pylint is throwing an error.

niekdejonge commented 1 year ago

This has been included in #124