openvax / mhcflurry

Peptide-MHC I binding affinity prediction
http://openvax.github.io/mhcflurry/
Apache License 2.0
193 stars 58 forks source link

mhcflurry-predict errors if input CSV contains bad allele name #113

Closed julia326 closed 6 years ago

julia326 commented 7 years ago

This should be more error-tolerant, or have a more informative error message:

(smoking) julia@demeter-csmaz11-19:/data/mhcflurry$ mhcflurry-predict --models /data/mhcflurry/models_fold_0 --out /data/mhcflurry/predictions.fold_0.csv /data/mhcflurry/fold_0.test.csv
Using Theano backend.
Read input CSV with 48387 rows, columns are: allele, peptide, measurement_value, measurement_type, measurement_source, original_allele
Traceback (most recent call last):
  File "/home/julia/Envs/smoking/bin/mhcflurry-predict", line 11, in <module>
    load_entry_point('mhcflurry', 'console_scripts', 'mhcflurry-predict')()
  File "/home/julia/code/mhcflurry/mhcflurry/predict_command.py", line 209, in run
    throw=not args.no_throw)
  File "/home/julia/code/mhcflurry/mhcflurry/class1_affinity_prediction/class1_affinity_predictor.py", line 531, in predict_to_dataframe
    mhcnames.normalize_allele_name)
  File "/home/julia/Envs/smoking/local/lib/python2.7/site-packages/pandas/core/series.py", line 2313, in map
    new_values = map_f(values, arg)
  File "pandas/_libs/src/inference.pyx", line 1521, in pandas._libs.lib.map_infer
  File "/home/julia/Envs/smoking/local/lib/python2.7/site-packages/mhcnames/normalization.py", line 72, in normalize_allele_name
    parsed_alleles = parse_classi_or_classii_allele_name(raw_allele)
  File "/home/julia/Envs/smoking/local/lib/python2.7/site-packages/mhcnames/class2.py", line 51, in parse_classi_or_classii_allele_name
    parsed = parse_allele_name(name, species)
  File "/home/julia/Envs/smoking/local/lib/python2.7/site-packages/mhcnames/allele_name.py", line 73, in parse_allele_name
    "species in the name itself: %s, %s, %s" % (species_prefix, species_from_name, original))
ValueError: If a species is passed in, we better not have another species in the name itself

(n this case, the offending allele in the input turned out to be "HLA-SLA01"):

>>> import mhcnames
>>> mhcnames.normalize_allele_name("HLA-SLA01")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/julia/Envs/smoking/local/lib/python2.7/site-packages/mhcnames/normalization.py", line 72, in normalize_allele_name
    parsed_alleles = parse_classi_or_classii_allele_name(raw_allele)
  File "/home/julia/Envs/smoking/local/lib/python2.7/site-packages/mhcnames/class2.py", line 51, in parse_classi_or_classii_allele_name
    parsed = parse_allele_name(name, species)
  File "/home/julia/Envs/smoking/local/lib/python2.7/site-packages/mhcnames/allele_name.py", line 73, in parse_allele_name
    "species in the name itself: %s, %s, %s" % (species_prefix, species_from_name, original))
ValueError: If a species is passed in, we better not have another species in the name itself
timodonnell commented 7 years ago

I think the issue was that mhcnames 0.1.0 (version used to generate the training data a while ago) had this erroneous behavior:

In [2]: mhcnames.normalize_allele_name("SLA01")
Out[2]: 'HLA-SLA*01:01'

When this output is run through the current version of mhcnames we get the error you see.

It's not clear this error was really fixed though. The current master version of mhcnames gives an error instead of parsing it:

In [4]: mhcnames.normalize_allele_name("SLA-01")
---------------------------------------------------------------------------
AlleleParseError                          Traceback (most recent call last)
<ipython-input-4-39557fbce5ea> in <module>()
----> 1 mhcnames.normalize_allele_name("SLA-01")

/Users/tim/sinai/git/mhcnames/mhcnames/normalization.py in normalize_allele_name(raw_allele, omit_dra1)
     70     if raw_allele in _normalized_allele_cache[omit_dra1]:
     71         return _normalized_allele_cache[omit_dra1][raw_allele]
---> 72     parsed_alleles = parse_classi_or_classii_allele_name(raw_allele)
     73     species = parsed_alleles[0].species
     74     normalized_list = [species]

/Users/tim/sinai/git/mhcnames/mhcnames/class2.py in parse_classi_or_classii_allele_name(name)
     49             "Allele has too many parts: %s" % name)
     50     if len(parts) == 1:
---> 51         parsed = parse_allele_name(name, species)
     52         if parsed.species == "HLA" and parsed.gene.startswith("DRB"):
     53             alpha = AlleleName(

/Users/tim/sinai/git/mhcnames/mhcnames/allele_name.py in parse_allele_name(name, species_prefix)
    118         raise AlleleParseError("No MHC gene name given in %s" % original)
    119     if len(name) == 0:
--> 120         raise AlleleParseError("Malformed MHC type %s" % original)
    121
    122     gene = gene.upper()

AlleleParseError: Malformed MHC type 01

@iskandr is that desired behavior?

If not we should probably open an mhcnames issues

timodonnell commented 6 years ago

Closing this. SLA-01 is apparently a locus (e.g. HLA-A) not really an allele so I think it's not obviously incorrect for mhcnames to raise an error in this case.