kusterlab / prosit

Prosit offers high quality MS2 predicted spectra for any organism and protease as well as iRT prediction. When using Prosit is helpful for your research, please cite "Gessulat, Schmidt et al. 2019" DOI 10.1038/s41592-019-0426-7
https://www.proteomicsdb.org/prosit/
Apache License 2.0
85 stars 45 forks source link

Unsupported amino acids should ideally yield warnings, not crash the program #41

Open fburic opened 4 years ago

fburic commented 4 years ago

From a usability perspective, I would suggest that sequences with unsupported amino acids be skipped and warnings be issued. The current behavior simply has the program crash with a KeyError. The note on the main page about these amino acids is clear but having the user sanitize their input seems like much more effort than handling this through Prosit.

Here is my quick and dirty patch (just to get the tool running for me). Probably not the best way to handle this but it does rely on utils.peptide_parser() to not duplicate logic.

diff --git a/prosit/tensorize.py b/prosit/tensorize.py
index d4eefdb..d94127f 100644
--- a/prosit/tensorize.py
+++ b/prosit/tensorize.py
@@ -1,4 +1,5 @@
 import collections
+import logging
 import numpy as np

 from . import constants
@@ -45,7 +46,7 @@ def get_precursor_charge_onehot(charges):
     return array

-def get_sequence_integer(sequences):
+def get_sequence_integer(sequences):
     array = np.zeros([len(sequences), MAX_SEQUENCE], dtype=int)
     for i, sequence in enumerate(sequences):
         for j, s in enumerate(utils.peptide_parser(sequence)):
@@ -89,6 +90,14 @@ def csv(df):
     assert "modified_sequence" in df.columns
     assert "collision_energy" in df.columns
     assert "precursor_charge" in df.columns
+
+    n_all = df.shape[0]
+    df = df[df.modified_sequence.apply(is_supported_sequence)]
+    n_supported = df.shape[0]
+    if n_supported < n_all:
+        logging.warning(' {0} / {1} sequences were skipped as they '
+                        'contained unsupported amino acids'.format(n_all - n_supported, n_all))
+
     data = {
         "collision_energy_aligned_normed": get_numbers(df.collision_energy) / 100.0,
         "sequence_integer": get_sequence_integer(df.modified_sequence),
@@ -107,3 +116,12 @@ def csv(df):
     data["masses_pred"] = masses_pred

     return data
+
+
+def is_supported_sequence(seq: str) -> bool:
+    try:
+        _ = [ALPHABET[s] for s in utils.peptide_parser(seq)]
+        return True
+    except KeyError as e:
+        return False
+

PS Good work on developing Prosit! :smile:

tkschmidt commented 4 years ago

If you provide a PR, we are happy to merge it. Bonuspoints if you have a working test case :)

gessulat commented 4 years ago

Sorry to chip in. Originally, we designed the Prosit pipeline, so that it never changes any user input. We on purpose didn't skip sequences, because input and ouput files would have a different length. I believe that providing malformed input data should throw an error and not a warning. But maybe this premise changed in the Prosit pipeline.

@tkschmidt @LLautenbacher @fburic

fburic commented 4 years ago

I see, that makes sense. If the premise hasn't changed, then I'd only suggest making the error more explicit. In my case, it took me quite some time to figure out what was wrong (and I'm familiar with Python). Then I just decided it was more efficient (and acceptable) for me to fix it like above :)