piercelab / tcrmodel2

Apache License 2.0
33 stars 6 forks source link

Templates containing X don't work witn anarci #3

Closed pwl closed 1 year ago

pwl commented 1 year ago

When generating the tcr_seqs.json file I've run into

Traceback (most recent call last):
  File "run_tcrmodel2.py", line 334, in <module>
    app.run(main)
  File "/opt/conda/lib/python3.8/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/opt/conda/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "run_tcrmodel2.py", line 299, in main
    cdr3, seq=parse_tcr_seq.parse_anarci(anarci_out)
  File "/tcrmodel2/scripts/parse_tcr_seq.py", line 23, in parse_anarci
    num=int(fields[1])
ValueError: invalid literal for int() with base 10: 'Unknown'

This error was also mentioned in #2 . It seems to be caused by one of the templates containing an X amino acid, which leads to ANARCI raising

Error:  Unknown amino acid letter found in sequence: X

in which case parse_anarci returns Unknown.

In my case this was the5xot_D template but there are a lot more templates with X in data/databases/pdb_seqres.txt.

I'm not sure what to do about this. The error does not seem critical as the structures were already generated at this point. This could probably be handled by adding a special case in parse_anarci so that it returns an empty list in this case.

pwl commented 1 year ago

This seems to be an open issue with anarci.