matchms / ms2deepscore

Deep learning similarity measure for comparing MS/MS spectra with respect to their chemical similarity
Apache License 2.0
53 stars 24 forks source link

Duplicate inchikeys in tanimoto scores triggers opaque error during training #97

Closed zdk123 closed 2 years ago

zdk123 commented 2 years ago

Thanks for this cool package!

I just wanted to report this bug I just tripped over while attempting to retrain the Siamese network. It's obvious in hindsight that the user shouldn't be trying to pass duplicate inchikeys in the tanimoto score DataFrame. But right now this triggers are rather uninformative error:

   inchikey2 = self._find_match_in_range(inchikey1, target_score_range)
  File "/home/zkurtz/.conda/envs/ms2deepscore/lib/python3.8/site-packages/ms2deepscore/data_generators.py", line 203, in _find_match_in_range
    matching_inchikeys = self.reference_scores_df.index[
  File "/home/zkurtz/.local/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 4616, in __getitem__
    result = getitem(key)
IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed

And even after de-duplicated the full keys, if there are still duplicated inchikey14s, the bug that gets triggered is:

  File "/home/zkurtz/.conda/envs/ms2deepscore/lib/python3.8/site-packages/ms2deepscore/data_generators.py", line 224, in __getitem__
    X, y = self.__data_generation(spectrum_pairs)
  File "/home/zkurtz/.conda/envs/ms2deepscore/lib/python3.8/site-packages/ms2deepscore/data_generators.py", line 291, in __data_generation
    y[i_pair] = self.reference_scores_df[pair[0].get("inchikey")[:14]][pair[1].get("inchikey")[:14]]
ValueError: setting an array element with a sequence.

I can provide an example if needed - you could also consider deduplicating (with a warning?) during the data_generator construction.

svenvanderburg commented 2 years ago

@zdk123 nice that you can make good use of the package! Thanks for spotting this!

I think we should just raise a more informative exception if duplicate inchikeys are passed.

If you want @zdk123 you could try to implement this yourself? See https://github.com/matchms/ms2deepscore/blob/main/CONTRIBUTING.md on more information for how to contribute.

zdk123 commented 2 years ago

sure, I can take a stab at that!

svenvanderburg commented 2 years ago

Cool!

svenvanderburg commented 2 years ago

@zdk123 thank you for your awesome contribution! Let me know if you are interested in becoming a collaborator :)

zdk123 commented 2 years ago

@svenvanderburg sure, I would be happy to!

florian-huber commented 2 years ago

Hi @zdk123 , thanks for making this issue and welcome on-board 😃 . I just invited you to be collaborator on this project and gave you writing rights (e.g. to start pull requests). Let me know if you need any help.

zdk123 commented 2 years ago

Thanks!

On Thu, Mar 10, 2022 at 3:12 AM Florian Huber @.***> wrote:

Hi @zdk123 https://github.com/zdk123 , thanks for making this issue and welcome on-board 😃 . I just invited you to be collaborator on this project and gave you writing rights (e.g. to start pull requests). Let me know if you need any help.

— Reply to this email directly, view it on GitHub https://github.com/matchms/ms2deepscore/issues/97#issuecomment-1063778522, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAUD2RCEKXTIRFYCOEAJNSDU7GVHRANCNFSM5PJEOIBQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>