Closed malachig closed 1 year ago
We're planning on testing https://github.com/murphycj/AGFusion for annotating output from fusion callers.
Fusion callers to consider: STAR-Fusion, Kallisto/Pizzly, FusionCatcher (https://github.com/ndaniel/fusioncatcher), Arriba (https://github.com/suhrig/arriba), nfcore/rnafusion (and ensemble approach, https://github.com/nf-core/rnafusion).
I wanted to have a try and using some novel splicing events identified stringtie that correlate with somatic mutations near splice-sites and have a preliminarily look to see if that there are possible neoantigens (using WES + RNA-seq). So was going to try ant "hot-wire" pVACfuse Putting the splice junction into bedpe format is not a problem.
Can you give any advice of using INTEGRATE-neo I have the HLA type so I could make up a a HLA_allelels.tzv myself and try command integrate-neo.py -t HLA_alleles.tsv -f fusions.bedpe -r ref.fa -g ref.genePhred -k Does this seem resonable? I don know the form of ref.genePred if there an example of that? Thanks in advance Paul
@pjleo I would suggest to post your question to the integrate NEO GitHub.
@zkidmore ran AGfusion on a test data set. Looking at the results (/gscmnt/gc2697/sclc/rna/agfusion/)
AGfusion creates a subdirectory for each fusion that each contain the following files:
<fusion>_cdna.fa
: a fasta file with one sequence for this fusion<fusion>.exons.txt
: a list of exons in the fusion<fusion>.fusion_transcripts.txt
: some more information on the two transcripts involved in the fusion<fusion>.protein_domains.txt
: not sureNone of these files can be used directly in pVACfuse. The <fusion>_cdna.fa
and <fusion>.fusion_transcripts.txt
files I think contain most of the information we need although I don't see any protein position information to show where in the protein sequence the fusion occurs. Maybe we can infer this somehow from the other information?
If someone with better understanding of fusions could chime in here, that would be great
@susannasiebert I think what you would actually care about is :
<fusion>.protein.fa
also theres an option in agfusion i didn't use but would perhaps be better if i did
-ms, --middlestar (Optional) Insert a * at the junction position for the cdna, cds, and protein sequences (default False).
You're right. <fusion>.protein.fa
is what we want. It doesn't get created for all fusion, I guess because some are not protein coding.
Yeah, the * is what we need to determine what the fusion point is.
<fusion>.fusion_transcripts.txt
has some additional information we'd want, I think.
~The only other thing I'm trying to figure out is how to tell if a fusion is inframe or frameshift.~ This is also in the protein.fa file
something to be aware of from the agFusion docs:
Annotation is by default done only for canonical gene isoforms, but there is the option to annotate all gene non-canonical isoform combinations.
I'm not sure if this is something we would want? @chrisamiller @malachig any thoughts
Yeah, that's the crux of what makes the problem difficult - multiple transcripts, possibly with different reading frames, different isoforms that are tissue (or even cancer!) specific, etc.
That all said, I think using the canonical isoform is fine for now. It'll be right a large percentage of the time and that's better than not having predictions at all!
It would be great if we could run it this way, just to see what the output would be like. I assume it would just output a bunch more directories for all of the other possibilities but it would be great to double check.
I'm about 90% toward support for AGfusion outputs. The last sticking point is how to parse the information for chr start stop from the AGfusion output files. The only file that appears to have information like this is the .exon.txt
file but I'm unsure how to correctly interpret the start and stop. Take for example this file:
5'_gene 3'_gene 5'_transcript 3'_transcript exon_gene_source exon_number exon_chr exon_start exon_end
RP11-465C12.1 CASC17 ENST00000637627 ENST00000569074 '5 gene 1 17 71871687 71871750
RP11-465C12.1 CASC17 ENST00000637627 ENST00000569074 '5 gene 2 17 71870553 71870664
RP11-465C12.1 CASC17 ENST00000637627 ENST00000569074 '5 gene 3 17 71865946 71866033
RP11-465C12.1 CASC17 ENST00000637627 ENST00000569074 '5 gene 4 17 71858621 71858688
RP11-465C12.1 CASC17 ENST00000637627 ENST00000569074 '3 gene 2 17 71185596 71185655
RP11-465C12.1 CASC17 ENST00000637627 ENST00000569074 '3 gene 3 17 71132390 71132470
RP11-465C12.1 CASC17 ENST00000637627 ENST00000569074 '3 gene 4 17 71097775 71099618
The positions decrease from exon to exon. This occurs in some cases but not all, depending, I presume, on how the strands exactly got fused. However, I'm not sure which start and stop we would want to use as the 5' and 3' fusion overall fusion start and stop. Would the 5' start and stop be 71871687 and 71858688? What about the 3'?
Hrmm - their docs are pretty sparse. That doesn't look like a fusion from those entries alone. They are probably spitting out all exons for plotting purposes. Do you have an example that includes both genes involved in the fusion?
@chrisamiller I'm attaching a zip archive of all the exon files from Zach's test run. exons.zip
I talked this situation through with @ahwagner and @zlskidmore yesterday. Here is an example of a fusion with two transcripts that occur on the - strand. The table in the bottom reflects how we're suggesting parsing the chr-start-stop information.
Is there any way to use arriba with pVACfuse? The AGfusion suggested hasn't been updated since 2019 and does not support arriba, to annotate it to be then compatible with pVACfuse.
Unfortunately, we will only be able to support inputs that have already been annotated. So far, AGfusion has been the best tool we've found for this but as you mentioned, it doesn't appear to be supported any longer. If you know of any other fusion annotation tools, we'd be happy to investigated them further.
Relevant to annotating arriba fusions (annoFuse): https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03922-7
is there any future plans to support annotations based on annoFuse for pVACfuse? The developer for agfusion just posted 15 days ago that it's no longer actively maintained.
Thanks for bringing this to our attention. We will further investigate annoFuse as a replacement for AGFusion.
@malachig according to https://arriba.readthedocs.io/en/latest/output-files/ Arriba already outputs the fusion peptide sequence so I don't know whether we really need AnnoFuse. Is there any additional functionality that AnnoFuse provides that we might want?
Discussing with Susanna, I made a transcript fusion analysis with HCC1395 by using Arriba. This analysis was done with STAR 2.7.10a and Arriba v2.3.0, recommended in their tutorial at https://arriba.readthedocs.io/en/latest/workflow/#demo-script.
For information, I summarized in detail at https://github.com/griffithlab/immuno-planning/issues/14.
Our current reliance on INTEGRATE (for fusion detection) and INTEGRATE-NEO for annotation of the fusion junction peptide is problematic.
It would be desirable if we could support fusion calls from other RNA fusion detection tools (e.g. STAR-fusion, pizzly, fusioncatcher, arriba, rnafusion (https://github.com/nf-core/rnafusion).
For some of these, support conversion to BEDPE?
Create or modify a tool that annotates a BEDPE to determine the relevant protein sequence.