griffithlab / pVACtools

http://www.pvactools.org
BSD 3-Clause Clear License
137 stars 59 forks source link

How does pvacfues derive the START/STOP of fusion transcript that "resulted" in a peptide? #1033

Closed min-codes closed 7 months ago

min-codes commented 10 months ago

Hi there,

As stated in the title - does pvacfuse derive the START/STOP of the fusions from it's own build of reference genome, or is that information taken from the output of AGFusion? I am asking as there is a certain degree of discrepancy between the breakpoint output by pVacFuse VS those ouput by fusion transcription detection tools (Arriba/FusionCatcher).

Example:

ACTB__NEMF FusionCatcher: chr7:5527660:- chr14:49844131:- pVacFuse: 7/14 5527660/49782083 5530709/49840866

susannasiebert commented 10 months ago

We parse this information from the AGFusion output. Specifically, the fusion 5' and 3' partners' start and stop positions are determined by parsing the AGFusion exons file and finding the first and last exons of each partner and then picking the exon_start and exon_end of those exons, respectively.

Would you be able to share your AGFusion output for the particular fusion with us? Particularly the exons file? I would like to have a look and see if there might be a bug in our logic.

min-codes commented 9 months ago

ACTB_NEMF.exons.csv ACTB_NEMF.fusion_transcripts.csv ACTB_NEMF.protein_domains.csv

Apologies for the late reply. Here are the files - pls let me know if there's anything odd.

Thank you very much!

min-codes commented 9 months ago

fusion

I made a diagram to check if i understood your explanation of the breakpoints. Could you help me confirm if this is correct?

Also - since the maximum length of a peptide is up to 30-mer (equivalent to 90bp), does pVacFuse only extract peptides that sit across the predicted fusion breakpoint, or would it output any peptide that sits anywhere on the entire fusion transcript?

susannasiebert commented 8 months ago

Thank you for getting those files to me. I apologize for not replying sooner.

After investigating the positions in the AGFusion exons file, I don't see position 49844131 as reported by FusionCatcher so I'm not sure where they get that number from. The boundaries reported by pVACfuse seem to match what I would expect given the data in the exon file.

Your visual of how we determine the start and end positions is basically correct, with caveats. It all gets a bit more complicated because you have to take into account strands as well. For pVACfuse we consider the smallest genomic position as the start, even if, for example, the 5' partner is on the reverse strand and the smallest genomic position is the one at the breakpoint.

When determining neoantigen candidates, we only consider peptides overlapping the fusion position. Fusions may be result in either an in-frame fusion or a frameshift fusion. If the fusion is a frameshift, we consider all neoantigen candidates from the downstream (3') fusion partner since they could be novel.

susannasiebert commented 7 months ago

Closing this issue due to inactivity.

min-codes commented 5 months ago

Thanks for the explanation @susannasiebert . image

I have another question about how peptide sequences are derived. I've made another diagram above for easy reference. May i check if i have the right understanding about peptide sequence prediction- whereby the fusion transcript CDS (from AGFusion) is 'translated' by pVacFuse in +1,+2,+3 reading frame, and any peptides that cross the breakpoint would be considered "one neoantigen" - if they pass the <500nM affinity threshold?

I am asking because - I was trying to 'reproduce' the peptide sequence generated by pvacfuse - by manually combining CDS of 5' transcript before the breakpoint + CDS of 3' transcript after the breakpoint, then input this sequence into 6 frame translation tool like this one here. Surprisingly, I could not see the predicted peptide sequence this way. Could you help me understand this better?

Here is the pvacfuse output filtered.tsv for this fusion detected in my sample, as well as it's raw arriba output file, for your reference. pvf_sample_2_arr.filtered.xls arr_2.xls

susannasiebert commented 5 months ago

pVACtools doesn't translate the cDNA sequences into peptide sequences itself. It uses the ones calculated by the annotation tools (Arriba/AGFusion) and extracts neoantigen candidate windows overlapping the breakpoint. If the fusion is a frameshift fusion, we also make predictions for any of the downstream windows since they are potentially novel as well. I do see that the predicted peptide YLWENSWEM doesn't seem to match the Arriba peptide sequence VDNLQGDSGRGYYLEMLIGTPPQK|lqsp. Was that pVACfuse output produced from that input file?