suhrig / arriba

Fast and accurate gene fusion detection from RNA-Seq data
Other
228 stars 50 forks source link

fusion strand #26

Closed kgaonkar6 closed 4 years ago

kgaonkar6 commented 5 years ago

Hello,

I am curious about the strand1 and strand2 columns in arriba output. I read in the documentation that the strand before "/" is the strand of the gene from the gtf and the second strand (that is after "/") is from the assembled fusion.

However I'm not sure how arriba assembles the fusion supporting reads to get the strand information from the STAR BAM file. Can you please explain?

Thanks!

suhrig commented 5 years ago

Assembly is not used to predict the transcribed strand (given after the slash). The transcribed strand is derived from either the strand of the supporting reads (if a stranded library was used) or from splice patterns of the supporting reads (if a non-stranded library was used).

Assembly is used to determine the transcript sequence around the breakpoints. Basically, Arriba performs a pileup of all the reads supporting a fusion and takes the most frequent base for each position.

Does this answer your question?

kgaonkar6 commented 5 years ago

Thank you that does help.

So to clarify in the draw_fusions.R the script reads the "after the slash" strand information to gather/re-arrange the gene structure (and overlapping domains) to plot the fusion? That visualization looks amazing so I'm trying to understand it better.

suhrig commented 5 years ago

draw_fusions.R could in theory use the strand information after the slash, but it does not need to, because Arriba arranges gene1 and gene2 to reflect the 5' and 3' end of the fusion. Transcription always starts in gene1 and ends in gene2 (with few exceptions where it's completely unclear). draw_fusions.R orients the fusion gene such that transcription goes from left to right. It does not need to evaluate the transcribed strand, because this information is implicit in the order of gene1 and gene2.

That visualization looks amazing so I'm trying to understand it better.

Thanks. Feel free to ask if you would like to customize some aspects. What may take an hour for you may be a matter of minutes for me. I can either send you a custom patch or - if the request is likely useful to many users - add a new feature.

kgaonkar6 commented 5 years ago

Oh great! I'm actually trying to visualize STAR-fusion calls the same way arriba visualizes them to have similar plots from both callers which is why I had specific questions as to how the draw_fusion.R script works. https://gist.github.com/kgaonkar6/d2efc3660d31c20217a9071c6244e70a is what I have now to format the files correctly which seems to work to create the similar plots. But if you can send it as a patch to the draw_fusion.R script itself that would be really appreciated. I guess others might find it useful as well..?

Ephedria commented 4 years ago

Hello @kgaonkar6

I'm trying to use your script to draw Fusions from StarFusion. It seem to work for some fusions but not all. Did you encounter this kind of behavior ?

Best,

kgaonkar6 commented 4 years ago

Hi @Ephedria , I was able to plot all the fusions from my StarFusion file. However, I did notice if the gene was not found in the gtf file provided it would print out "Error XYZ exon coordinate not found in gtf" on the pdf page. Also you might want to check the format/colnames of Starfusion, mine was output from STAR-Fusion 1.5.0

Did you get any errors?

Ephedria commented 4 years ago

Hello Krutika, I didn't get an error like you, mine stopped the process right away :

Warning message: In colnames(fusions)[colnames(fusions) %in% c("X.gene1", "strand1.gene.fusion.", : number of items to replace is not a multiple of replacement length Loading annotation Loading protein domains Drawing fusion #1: CALR:TANC2 Drawing fusion #2: SENP6:MYO6 Drawing fusion #3: IKZF2:ERBB4 Error in if ((codingLength1 == 0 || antisenseTranscription1) && (codingLength2 == : missing value where TRUE/FALSE needed Calls: drawProteinDomains Execution halted

I tried to look at the line that cause the bug, I didn't found anything relevant... My output is from a FusionInspector with the --examine_coding_effect which is the same as StarFusion.

I didn't thought of this possibility since your script worked fine. Could you send me an exemple of your StarFusion output so I may try to understand what cause the bug ?

Best,

kgaonkar6 commented 4 years ago

Sorry for the late reply, here you go: testStarFusion.txt

Ephedria commented 4 years ago

Thanks for the link, I will try to found out the problem.

suhrig commented 4 years ago

I also apologize for the lack of responses from my side. I have been really busy lately.

In colnames(fusions)[colnames(fusions) %in% c("X.gene1", "strand1.gene.fusion.", :

This is likely the source of the error. I suspect, the column names of your file are not correct. You should check that they are identical to the ones that are generated by Arriba.

I plan on making the draw_fusions.R script compatible with STAR-Fusion and could send you the script. But I probably will not come around to doing this in the next 1-2 weeks.

Ephedria commented 4 years ago

Hello Sorry I forgot to answer ... The header doesn't exactly look like Arriba's but it work on some fusions and not on others fusions. I don't find out why ... I will love to see draw_fusions.R working with StarFusion. I hope this is not too much work for you ...

Best,

suhrig commented 4 years ago

I already have a prototype of draw_fusions.R that is compatible with STAR-Fusion. I'll push it to GitHub next week probably. Still need to do some testing. Stay tuned.

Ephedria commented 4 years ago

Ho cool I will be eager to test it !

suhrig commented 4 years ago

Hi Krutika,

I just uploaded an enhanced version of draw_fusions.R to the development branch of the repository that should also be compatible with STAR-Fusion output (both with and without FusionInspector extra columns). You can download it from here:

https://raw.githubusercontent.com/suhrig/arriba/develop/draw_fusions.R

Feel free to give it a try if you find the time.

Regards, Sebastian

Ephedria commented 4 years ago

Thanks for the hard work.

I was testing it with an input of FusionInspector but I didn't seem to work ... I had this error :

Loading annotation Warning message: In parseGtfAttribute("exon_number", exons) : Failed to parse 'exon_number' attribute of 558 GTF record(s). Loading protein domains

Can you send me the command used to create your output of StarFusion ? I will try to stick to it.

Best, Ephe

suhrig commented 4 years ago

Warning message: In parseGtfAttribute("exon_number", exons) :

This error is related to the GTF file and has nothing to do with the fusion file. What GTF are you using? Can you send me a link or upload it somewhere?

Moreover, this is only a warning and in view of the fact that only 558 records are affected (of probably tens of thousands), this should not even be a concern. Is a PDF file generated as output?

Ephedria commented 4 years ago

Hum quite strange, because I try with an ouput from Arriba and it worked well. I didn't change the gtf file. I use the : GRCh38_gencode_v31_CTAT_lib_Oct012019.plug-n-play/ctat_genome_lib_build_dir/ref_annot.gtf

It does create a PDF file, but it's empty ...

suhrig commented 4 years ago

The error about parsing the GTF file should appear regardless of whether you use Arriba or STAR-Fusion. Can you confirm? Please send me the full call to draw_fusions.R for both Arriba and STAR-Fusion.

When I use CTAT annotation, the GTF parsing error is shown with output from both Arriba and STAR-Fusion. So at least the error is definitely not related to the fusions file. And in both cases a non-empty PDF file is produced. I'm suspecting an issue with the content of your STAR-Fusion file that the script cannot handle. Can you share the content? Or at least the header line and one fusion prediction? Thanks.

Ephedria commented 4 years ago

I did verified, you were right, I do have the same error when trying with an Arriba output...

Here is the output with Arriba output : Loading annotation Warning message: In parseGtfAttribute("exon_number", exons) : Failed to parse 'exon_number' attribute of 558 GTF record(s). Loading protein domains Drawing fusion #1: RNASEH2B-AS1:RNASEH2B Drawing fusion #2: AC004943.2:AC116667.2 ....

And here is the output with a FusionInspector output :

Loading annotation Warning message: In parseGtfAttribute("exon_number", exons) : Failed to parse 'exon_number' attribute of 558 GTF record(s). Loading protein domains /mnt/go-docker/cmd.sh: line 8: 36 Killed /scratch2/tmp/mneou/Scripts/draw_fusions.R --fusions="/scratch2/tmp/mneou/FusionInspector/MAP634_WTS/MAP634_WTS.FusionInspector.fusions.abridged.tsv.coding_effect" --alignments="/scratch2/tmp/mneou/Arriba/MAP634_WTS/STAR/MAP634_WTSAligned.sortedByCoord.out.bam" --output="/scratch2/tmp/mneou/FusionInspector/MAP634_WTS/MAP634_WTS_fusions_CR.pdf" --annotation=/data/annotations/Human/GRCh38/index/star-fusion/v1.8/GRCh38_gencode_v31_CTAT_lib_Oct012019.plug-n-play/ctat_genome_lib_build_dir/ref_annot.gtf --cytobands=/database/cytobands_hg38_GRCh38_2018-02-23.tsv --proteinDomains=/database/protein_domains_hg38_GRCh38_2019-07-05.gff3

What annotation are you using ? I will try to use the same one, if it can prevent this warning ?

Sure here it is : https://github.com/Ephedria/FusionInspector/blob/master/Sample1_FusionInspector.tsv

Thanks again for the help.

Best

suhrig commented 4 years ago

/mnt/go-docker/cmd.sh: line 8: 36 Killed

This is the real issue. The script got killed by an external process. The most likely explanation is that it ran out of memory. Typically, the script needs < 2GB. Does the script run inside a job with an artificial memory limit? Another explanation is that your job has a time limit and was killed after reaching that limit. Either way, this is not a bug in the script, but an operating system issue.

What annotation are you using ? I will try to use the same one, if it can prevent this warning ?

I use GENCODE >= v28. The warning is harmless, though. You can continue using your GTF file if you want. All this means is that about 1% of the genes will lack exon labels when they are drawn, i.e., there is no number inside the exon. So it's only cosmetic.

Ephedria commented 4 years ago

Hoo my bad, you were right I did ran out of memory ... I did again by expending the memory allocatated it works fine.

I will try to test on different samples, but I pretty sure it will turn out well. Thanks again for this tool and the help.

Best,

suhrig commented 4 years ago

Hi Krutika,

Arriba version 2.0.0 has just been released. It natively supports creating visualizations of fusion predictions made by STAR-Fusion. So this issue is finally resolved. Let know if you still have questions. I am closing this thread for now.

Regards, Sebastian