matchms / ms2deepscore

Deep learning similarity measure for comparing MS/MS spectra with respect to their chemical similarity
Apache License 2.0
48 stars 22 forks source link

Training crashes, when too little spectra are used. #155

Open niekdejonge opened 8 months ago

niekdejonge commented 8 months ago

The training of a ms2deepscore model fails in an unexpected way if there are not enough spectra in the training_generator or validation_generator.

When the training_generator does not have enough spectra the error is: ValueError: Unexpected result of train_function (Empty logs). This could be due to issues in input pipeline that resulted in an empty dataset. When the validation_generator does not have enough spectra the error is: FileNotFoundError: [Errno 2] Unable to open file (unable to open file: name = 'ms2deepscore_model.hdf5') (Probably because checkpointer did not work)

These problems are uninformative and common, since many people will first test the model creation or a low number of spectra and with test creation I try to use a number of spectra that is as small as possible.

I want to catch the error before training and give an informative error, but I was not yet able to debug it. It seems like tensorflow is also doing weird stuff to the debugger while running, making it hard to find the issue. It is unclear to me what is the minimal requirement (number of inchikeys? number of spectra? number of tanimoto scores in each bin?)

To reproduce: Go to test_train_wrapper_ms2ds_model in test_train_ms2deepscore.py and add [:10] to one of the input spectra.

justinjjvanderhooft commented 8 months ago

Hmm, until we find out the real minimum input needed (or the actual culprit), could we alert the user that attempts to provide too little input spectra?

niekdejonge commented 8 months ago

Yes that is what I want to do. I could just set it to 100 spectra or 100 unique inchikeys, but, since I am not fully sure what it is it is hard to set a number, I hope @florian-huber can help out and maybe knows what is the minimum required number of spectra.

Normally it is easy to debug something like this, because you get clear errors that can be traced back, but I think tensorflow does some kind of run optimization that makes it difficult to find the exact issue.

niekdejonge commented 8 months ago

In MS2Query I solved it by just using large test sets of 2000 spectra, to never have that issue, but as a result the tests run very long (minutes) which is really unpractical during development, so it is probably worth finding the exact issue, so we can design test sets that are as small as possible, without breaking the training.

justinjjvanderhooft commented 8 months ago

Okay, let's see what @florian-huber knows and otherwise we could see if there is a forum where we could post such a question?

niekdejonge commented 8 months ago

I think I figured it out now. It seems like the number of unique inchikeys has to be larger than the batch size. So for the tests we can use smaller batch sizes during training to allow for a low number of tests and I will add a test in the data generators that checks if there are more unique inchikeys than the batch size.

justinjjvanderhooft commented 8 months ago

Ah, great, could we include a check that this is the case when users start a (test) training, and if not, a useful error will return?

niekdejonge commented 8 months ago

Yes, I will add a check in the data generators that checks if there are more unique inchikeys than the batch size. This should indeed return a clear error message.