griffithlab / pVACtools

http://www.pvactools.org
BSD 3-Clause Clear License
137 stars 59 forks source link

arriba output containing ? in peptide sequence will trigger error in manufacturability calculations #1001

Closed boyangzhao closed 1 year ago

boyangzhao commented 1 year ago

Installation Type

Standalone

pVACtools Version / Docker Image

4.0.1

Python Version

3.8

Operating System

No response

Describe the bug

Arriba outputs are allowed to contain ? in the peptide_sequence and fusion_transcript columns. This will trigger an error in the manufacturability calculations as ? is not an amino acid in the dict. See the error message in log output below.

How to reproduce this bug

I've provided an example, I modified a bit the content, but the issues are the same. `?` containing sequences will trigger an error.

Input files

#gene1  gene2   strand1(gene/fusion)    strand2(gene/fusion)    breakpoint1 breakpoint2 site1   site2   type    split_reads1    split_reads2    discordant_mates    coverage1   coverage2   confidence  reading_frame   tags    retained_protein_domains    closest_genomic_breakpoint1 closest_genomic_breakpoint2 gene_id1    gene_id2    transcript_id1  transcript_id2  direction1  direction2  filters fusion_transcript   peptide_sequence    read_identifiers
PTEN    ENSG00000200891(21548),MED6P1(31892)    +/+ ./+ chr10:87952259  chr10:88016243  CDS/splice-site intergenic  deletion/read-through   163 146 6   3691    3002    high    out-of-frame    .   C2_domain_of_PTEN_tumour-suppressor_protein(100%),Dual_specificity_phosphatase__catalytic_domain(100%)| .   .   ENSG00000171862 .   ENST00000371953 .   downstream  upstream    duplicates(384),low_entropy(4),mismappers(15),mismatches(6),multimappers(2) CCTCACCTCCATGCAGATGCAGCTGTACCTGCAGCAGCTGCAGAAGGTGCAGCCCCCTACGCCGCTACTCCCTTCCGTGAAGGTGCAGTCCCAGCCCCCcCCCCCCCccCCcCCCCCcCCCCcCCCC|CCC??CCCCC?CCCCCCCTGCCGCCCCCACCCCACCCCTCTGTGCAGCAGCAGCTGCAGCAGCAGCCGCCACCACCCCCACCACCCCAGCCCCAGCCTCCACCCCAGCAGCAGCATCAGCCCCCTCCACGGCCCGTGCACTTGCAGCCCATGCAGTTTTCCACCCA  LTSMQMQLYLQQLQKVQPPTPLLPSVKVQSQPPPPpPPPPpP|p?p?ppcrphptplcssscsssrhhphhpspslhpsssisplhgpctcspcsfpp  K00193:38:H3MYFBBXX:4:1101:18904:31572,K00193:38:H3MYFBBXX:4:1101:20386:3197

Log output

Calculating Manufacturability Metrics
Traceback (most recent call last):
  File "/Users/boyangzhao/anaconda/envs/bio38/bin/pvacfuse", line 8, in <module>
    sys.exit(main())
  File "/Users/boyangzhao/anaconda/envs/bio38/lib/python3.8/site-packages/pvactools/tools/pvacfuse/main.py", line 108, in main
    args[0].func.main(args[1])
  File "/Users/boyangzhao/anaconda/envs/bio38/lib/python3.8/site-packages/pvactools/tools/pvacfuse/run.py", line 212, in main
    (input_file, per_epitope_output_dir) = generate_fasta(args, output_dir, epitope_length)
  File "/Users/boyangzhao/anaconda/envs/bio38/lib/python3.8/site-packages/pvactools/tools/pvacfuse/run.py", line 82, in generate_fasta
    pvactools.tools.pvacfuse.generate_protein_fasta.main(params, save_tsv_file=True, starfusion_file=args.starfusion_file)
  File "/Users/boyangzhao/anaconda/envs/bio38/lib/python3.8/site-packages/pvactools/tools/pvacfuse/generate_protein_fasta.py", line 132, in main
    CalculateManufacturability(args.output_file, manufacturability_file, 'fasta').execute()
  File "/Users/boyangzhao/anaconda/envs/bio38/lib/python3.8/site-packages/pvactools/lib/calculate_manufacturability.py", line 47, in execute
    scores = ManufacturabilityScores.from_amino_acids(sequence)
  File "/Users/boyangzhao/anaconda/envs/bio38/lib/python3.8/site-packages/vaxrank/manufacturability.py", line 148, in from_amino_acids
    return cls(*[fn(amino_acids) for fn in scoring_functions])
  File "/Users/boyangzhao/anaconda/envs/bio38/lib/python3.8/site-packages/vaxrank/manufacturability.py", line 148, in <listcomp>
    return cls(*[fn(amino_acids) for fn in scoring_functions])
  File "/Users/boyangzhao/anaconda/envs/bio38/lib/python3.8/site-packages/vaxrank/manufacturability.py", line 73, in max_7mer_gravy_score
    return max_kmer_gravy_score(amino_acids, 7)
  File "/Users/boyangzhao/anaconda/envs/bio38/lib/python3.8/site-packages/vaxrank/manufacturability.py", line 67, in max_kmer_gravy_score
    return max(
  File "/Users/boyangzhao/anaconda/envs/bio38/lib/python3.8/site-packages/vaxrank/manufacturability.py", line 68, in <genexpr>
    gravy_score(amino_acids[i:i + k])
  File "/Users/boyangzhao/anaconda/envs/bio38/lib/python3.8/site-packages/vaxrank/manufacturability.py", line 56, in gravy_score
    total = sum(
  File "/Users/boyangzhao/anaconda/envs/bio38/lib/python3.8/site-packages/vaxrank/manufacturability.py", line 57, in <genexpr>
    hydropathy_dict[amino_acid] for amino_acid in amino_acids)
KeyError: '?'

Output files

No response

susannasiebert commented 1 year ago

Thank you for reporting this error. It should be fixed in version 4.0.2. I'm closing this issue but please feel free to reopen it, should you still run into problems.