pVACfuse failing - raise InputError("Invalid Fasta format: No '>' found.")

yang-yangfeng commented 6 years ago

I'm noticing that the temp fasta file is empty - some issue producing that?

Command: bsub pvacfuse run --net-chop-method cterm --netmhc-stab --iedb-install-directory /gscmnt/gc2502/griffithlab/yafeng -e 8,9,10,11 /gscmnt/gc2142/griffithlab/yafeng/cwl_toil_runs/unstranded/paper_fusions/fusion_antigens_out/fusions.bedpe.annot sample HLA-A*29:02,HLA-A*29:02,HLA-B*08:01,HLA-B*45:01,HLA-C*07:01,HLA-C*06:02 NNalign NetMHC NetMHCIIpan NetMHCcons NetMHCpan PickPocket SMM SMMPMBEC SMMalign hcc1395_fuse_paper

Output Directory: /gscmnt/gc2142/griffithlab/yafeng/cwl_toil_runs/pvacseq_output/hcc1395_fuse_paper/

Error:


Converting .bedpe to TSV
Completed
Splitting TSV into smaller chunks
Splitting TSV into smaller chunks - Entries 1-11
Completed
Generating Variant Peptide FASTA and Key Files
Generating Variant Peptide FASTA and Key Files - Entries 1-22
Wildtype sequence length is shorter than desired peptide sequence length at position (1, -1, 167413114). Using wildtype sequence length (1) instead.
Wildtype sequence length is shorter than desired peptide sequence length at position (3, -1, 186787584). Using wildtype sequence length (7) instead.
Wildtype sequence length is shorter than desired peptide sequence length at position (3, -1, 186787584). Using wildtype sequence length (7) instead.
Wildtype sequence length is shorter than desired peptide sequence length at position (1, -1, 155717128). Using wildtype sequence length (8) instead.
Wildtype sequence length is shorter than desired peptide sequence length at position (20, -1, 45897040). Using wildtype sequence length (5) instead.
Wildtype sequence length is shorter than desired peptide sequence length at position (20, -1, 45897040). Using wildtype sequence length (5) instead.
Wildtype sequence length is shorter than desired peptide sequence length at position (16, -1, 67865229). Using wildtype sequence length (20) instead.
Wildtype sequence length is shorter than desired peptide sequence length at position (5, -1, 134910466). Using wildtype sequence length (9) instead.
Completed
Processing entries for Allele HLA-A*29:02 and Epitope Length 8 - Entries 1-22
Running IEDB on Allele HLA-A*29:02 and Epitope Length 8 with Method NetMHC - Entries 1-22
Traceback (most recent call last):
  File "/gscmnt/gc2502/griffithlab/yafeng/mhc_i/src/predict_binding.py", line 383, in <module>
    Prediction().main()
  File "/gscmnt/gc2502/griffithlab/yafeng/mhc_i/src/predict_binding.py", line 375, in main
    elif (len(args)  == 4):                            self.commandline_input(args)  # args=[method, mhc, length, fname]
  File "/gscmnt/gc2502/griffithlab/yafeng/mhc_i/src/predict_binding.py", line 96, in commandline_input
    proteins = self.read_protein(fname)
  File "/gscmnt/gc2502/griffithlab/yafeng/mhc_i/src/predict_binding.py", line 57, in read_protein
    protein.extract_fasta(f.read())
  File "/gscmnt/gc2502/griffithlab/yafeng/mhc_i/src/util.py", line 58, in extract_fasta
    raise InputError("Invalid Fasta format: No '>' found.")
util.InputError: Invalid Fasta format: No '>' found.
Traceback (most recent call last):
  File "/gscuser/yafeng/miniconda2/envs/python3/bin/pvacfuse", line 11, in <module>
    load_entry_point('pvacseq', 'console_scripts', 'pvacfuse')()
  File "/gscmnt/gc2142/griffithlab/yafeng/pVACtools/tools/pvacfuse/main.py", line 49, in main
    args[0].func.main(args[1])
  File "/gscmnt/gc2142/griffithlab/yafeng/pVACtools/tools/pvacfuse/run.py", line 99, in main
    pipeline.execute()
  File "/gscmnt/gc2142/griffithlab/yafeng/pVACtools/lib/pipeline.py", line 338, in execute
    split_parsed_output_files = self.call_iedb_and_parse_outputs(chunks)
  File "/gscmnt/gc2142/griffithlab/yafeng/pVACtools/lib/pipeline.py", line 468, in call_iedb_and_parse_outputs
    '-e', self.iedb_executable,
  File "/gscmnt/gc2142/griffithlab/yafeng/pVACtools/lib/call_iedb.py", line 56, in main
    response = run(prediction_class_object.iedb_executable_params(args), stdout=PIPE, check=True)
  File "/gscuser/yafeng/miniconda2/envs/python3/lib/python3.5/subprocess.py", line 398, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['python2.7', '/gscmnt/gc2502/griffithlab/yafeng/mhc_i/src/predict_binding.py', 'ann', 'HLA-A*29:02', '8', '/gscmnt/gc2142/griffithlab/yafeng/cwl_toil_runs/pvacseq_output/hcc1395_fuse_paper/hcc1395_fuse_paper/MHC_Class_I/tmp/sample_21.fa.split_1-22']' returned non-zero exit status 1```

susannasiebert commented 6 years ago

In your example input, all of the peptide sequences end in X, which results in them getting filtered out (we filter them because IEDB fails on this amino acid). We probably want to just delete the X (stop codon) amino acid from the sequence (if it's the last amino acid). @chrisamiller @jhundal can you confirm that this is the correct course of action?

We probably also want to handle cases where the fasta is empty (e.g., none of the fusions have a predicted peptide sequence).

chrisamiller commented 6 years ago

Yeah, if X is the stop, it should just be stripped and the remaining peptide fed forward. If stripping the X results in a peptide that is too short (less than the minimum epitope size to check), then it should be dropped.

susannasiebert commented 6 years ago

Another thing that is weird with the input file is that the fusion position sometimes is past the length of the predicted peptide sequence. I'm not sure what is going on with that. We should check with Jin why that is. @yang-yangfeng can you ask about that?

yang-yangfeng commented 6 years ago

I think the thing is that integrate tries to give the coordinates of the genomic fusion event but then integrate-neo tries to figure out the resulting peptide sequences that actually could be translated, which may not extend to the end of the 3p transcript?

susannasiebert commented 6 years ago

It was my understanding, that the position is supposed to be in reference to the fusion peptide sequence, not the 3p transcript

susannasiebert commented 6 years ago

Edit: I think I understand what you are trying to say. That seems weird to me still, but plausible. Right now those entries would end up getting skipped, which seems correct, right?

yang-yangfeng commented 6 years ago

Sorry, I think I explained that poorly/misunderstood what you were referring to. So I had thought, perhaps foolishly, that the fusion position was telling us something about where in the fusion junction the novel peptide starts? That doesn't totally make sense to me though - not sure exactly where neo starts counting - and also wouldn't only the 3p portion (plus maybe the last amino acid in the 5p portion) be novel? But it also doesn't totally make sense to me how it would work if the fusion position is describing a position within the fusion peptide - like where in the peptide the fusion event occurs. If this is the case, how is integrate-neo choosing where to start the peptide sequence it annotates with?

Here's some text from the integrate-neo paper that seems relevant, but also doesn't make it clear to me exactly what his happening. @chrisamiller might have a better understanding of how integrate-neo actually determines the fusion peptide.

This step annotates the gene fusion predictions with information such as gene fusion exonic boundaries, open reading frames (ORF), and the predicted peptide at the fusion junction. Each codon within the 5′ gene partner is inferred according to the starting position of the 5′ ORF. The amino acids spanning the fusion junction are determined by the codons that result from merging the sequences of both the 5′ and 3′ gene partners. The 3′ reading frames, which may have shifted, are then calculated for the remaining portion of the gene fusion transcript downstream of the fusion junction until a stop codon is encountered. These annotations are appended as user defined columns to the BEDPE file. Gene fusions that do not produce a predicted fusion peptide are subsequently filtered.

yang-yangfeng commented 6 years ago

Resolved the stop codon issue with #51.

The fusion position being past the end of the peptide was an issue with the reference files I used to run integrate-neo/bedpeAnnotator and not pVACfuse, so closing.

griffithlab / pVACtools

pVACfuse failing - raise InputError("Invalid Fasta format: No '>' found.") #49