Add Arriba/annoFuse support

malachig commented 5 years ago

Our current reliance on INTEGRATE (for fusion detection) and INTEGRATE-NEO for annotation of the fusion junction peptide is problematic.

It would be desirable if we could support fusion calls from other RNA fusion detection tools (e.g. STAR-fusion, pizzly, fusioncatcher, arriba, rnafusion (https://github.com/nf-core/rnafusion).

For some of these, support conversion to BEDPE?

Create or modify a tool that annotates a BEDPE to determine the relevant protein sequence.

susannasiebert commented 5 years ago

We're planning on testing https://github.com/murphycj/AGFusion for annotating output from fusion callers.

malachig commented 5 years ago

Fusion callers to consider: STAR-Fusion, Kallisto/Pizzly, FusionCatcher (https://github.com/ndaniel/fusioncatcher), Arriba (https://github.com/suhrig/arriba), nfcore/rnafusion (and ensemble approach, https://github.com/nf-core/rnafusion).

pjleo commented 5 years ago

I wanted to have a try and using some novel splicing events identified stringtie that correlate with somatic mutations near splice-sites and have a preliminarily look to see if that there are possible neoantigens (using WES + RNA-seq). So was going to try ant "hot-wire" pVACfuse Putting the splice junction into bedpe format is not a problem.

Can you give any advice of using INTEGRATE-neo I have the HLA type so I could make up a a HLA_allelels.tzv myself and try command integrate-neo.py -t HLA_alleles.tsv -f fusions.bedpe -r ref.fa -g ref.genePhred -k Does this seem resonable? I don know the form of ref.genePred if there an example of that? Thanks in advance Paul

susannasiebert commented 5 years ago

@pjleo I would suggest to post your question to the integrate NEO GitHub.

susannasiebert commented 5 years ago

@zkidmore ran AGfusion on a test data set. Looking at the results (/gscmnt/gc2697/sclc/rna/agfusion/)

AGfusion creates a subdirectory for each fusion that each contain the following files:

<fusion>_cdna.fa: a fasta file with one sequence for this fusion
<fusion>.exons.txt: a list of exons in the fusion
<fusion>.fusion_transcripts.txt: some more information on the two transcripts involved in the fusion
<fusion>.protein_domains.txt: not sure

None of these files can be used directly in pVACfuse. The <fusion>_cdna.fa and <fusion>.fusion_transcripts.txt files I think contain most of the information we need although I don't see any protein position information to show where in the protein sequence the fusion occurs. Maybe we can infer this somehow from the other information?

If someone with better understanding of fusions could chime in here, that would be great

zlskidmore commented 5 years ago

@susannasiebert I think what you would actually care about is :

<fusion>.protein.fa

also theres an option in agfusion i didn't use but would perhaps be better if i did

-ms, --middlestar (Optional) Insert a * at the junction position for the cdna, cds, and protein sequences (default False).

susannasiebert commented 5 years ago

You're right. <fusion>.protein.fa is what we want. It doesn't get created for all fusion, I guess because some are not protein coding.

Yeah, the * is what we need to determine what the fusion point is.

<fusion>.fusion_transcripts.txt has some additional information we'd want, I think.

~The only other thing I'm trying to figure out is how to tell if a fusion is inframe or frameshift.~ This is also in the protein.fa file

zlskidmore commented 5 years ago

something to be aware of from the agFusion docs:

Annotation is by default done only for canonical gene isoforms, but there is the option to annotate all gene non-canonical isoform combinations.

I'm not sure if this is something we would want? @chrisamiller @malachig any thoughts

chrisamiller commented 5 years ago

Yeah, that's the crux of what makes the problem difficult - multiple transcripts, possibly with different reading frames, different isoforms that are tissue (or even cancer!) specific, etc.

That all said, I think using the canonical isoform is fine for now. It'll be right a large percentage of the time and that's better than not having predictions at all!

susannasiebert commented 5 years ago

It would be great if we could run it this way, just to see what the output would be like. I assume it would just output a bunch more directories for all of the other possibilities but it would be great to double check.

susannasiebert commented 5 years ago

I'm about 90% toward support for AGfusion outputs. The last sticking point is how to parse the information for chr start stop from the AGfusion output files. The only file that appears to have information like this is the .exon.txt file but I'm unsure how to correctly interpret the start and stop. Take for example this file:

5'_gene 3'_gene 5'_transcript   3'_transcript   exon_gene_source    exon_number exon_chr    exon_start  exon_end
RP11-465C12.1   CASC17  ENST00000637627 ENST00000569074 '5 gene 1   17  71871687    71871750
RP11-465C12.1   CASC17  ENST00000637627 ENST00000569074 '5 gene 2   17  71870553    71870664
RP11-465C12.1   CASC17  ENST00000637627 ENST00000569074 '5 gene 3   17  71865946    71866033
RP11-465C12.1   CASC17  ENST00000637627 ENST00000569074 '5 gene 4   17  71858621    71858688
RP11-465C12.1   CASC17  ENST00000637627 ENST00000569074 '3 gene 2   17  71185596    71185655
RP11-465C12.1   CASC17  ENST00000637627 ENST00000569074 '3 gene 3   17  71132390    71132470
RP11-465C12.1   CASC17  ENST00000637627 ENST00000569074 '3 gene 4   17  71097775    71099618

The positions decrease from exon to exon. This occurs in some cases but not all, depending, I presume, on how the strands exactly got fused. However, I'm not sure which start and stop we would want to use as the 5' and 3' fusion overall fusion start and stop. Would the 5' start and stop be 71871687 and 71858688? What about the 3'?

chrisamiller commented 5 years ago

Hrmm - their docs are pretty sparse. That doesn't look like a fusion from those entries alone. They are probably spitting out all exons for plotting purposes. Do you have an example that includes both genes involved in the fusion?

susannasiebert commented 5 years ago

@chrisamiller I'm attaching a zip archive of all the exon files from Zach's test run. exons.zip

susannasiebert commented 5 years ago

I talked this situation through with @ahwagner and @zlskidmore yesterday. Here is an example of a fusion with two transcripts that occur on the - strand. The table in the bottom reflects how we're suggesting parsing the chr-start-stop information.

boyangzhao commented 2 years ago

Is there any way to use arriba with pVACfuse? The AGfusion suggested hasn't been updated since 2019 and does not support arriba, to annotate it to be then compatible with pVACfuse.

susannasiebert commented 2 years ago

Unfortunately, we will only be able to support inputs that have already been annotated. So far, AGfusion has been the best tool we've found for this but as you mentioned, it doesn't appear to be supported any longer. If you know of any other fusion annotation tools, we'd be happy to investigated them further.

malachig commented 2 years ago

Relevant to annotating arriba fusions (annoFuse): https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03922-7

boyangzhao commented 2 years ago

is there any future plans to support annotations based on annoFuse for pVACfuse? The developer for agfusion just posted 15 days ago that it's no longer actively maintained.

susannasiebert commented 2 years ago

Thanks for bringing this to our attention. We will further investigate annoFuse as a replacement for AGFusion.

susannasiebert commented 2 years ago

@malachig according to https://arriba.readthedocs.io/en/latest/output-files/ Arriba already outputs the fusion peptide sequence so I don't know whether we really need AnnoFuse. Is there any additional functionality that AnnoFuse provides that we might want?

gschang commented 2 years ago

Discussing with Susanna, I made a transcript fusion analysis with HCC1395 by using Arriba. This analysis was done with STAR 2.7.10a and Arriba v2.3.0, recommended in their tutorial at https://arriba.readthedocs.io/en/latest/workflow/#demo-script.

For information, I summarized in detail at https://github.com/griffithlab/immuno-planning/issues/14.

griffithlab / pVACtools

Add Arriba/annoFuse support #306