suhrig / arriba

Fast and accurate gene fusion detection from RNA-Seq data
Other
226 stars 50 forks source link

Arriba for plate-based scRNA-seq #206

Open Silvia-Bio opened 1 year ago

Silvia-Bio commented 1 year ago

Hiya,

Can Arriba be used to detect fusions in single-cell RNAseq data (e.g. Smart-seq2). I expect to see a greater number of false positives compared to bulk RNAseq, and with a lower number of reads supporting the fusions. I'm wondering what parameters should be optimized for scRNAseq data, and how the list of candidate fusions should be further filtered.

Thank you,

S.

suhrig commented 1 year ago

Hi Silvia,

I haven't run Arriba run on SMART-Seq2 data. I will give this a try and report back. Arriba automatically adjusts to the sequencing depth and has quite sensitive filters in general, so you might be able to run it with the default parameters.

One important note upfront: If your library has UMIs, you should use the flag -u (see manual) and perform duplicate marking before running Arriba, because Arriba is not aware of UMIs.

Regards, Sebastian

Silvia-Bio commented 1 year ago

Hi Sebastian,

Thanks so much for getting back to me!

I went ahead and tried Arriba on my Smart-seq2 data (no UMIs), using the default settings. As suggested in the Arriba manual, I undertook further filtering to refine the results. Specifically, I:

I also analysed matched bulk RNA-seq data using Arriba. I was quite encouraged to see significant overlap in the fusion findings between the single-cell and bulk RNA-seq datasets. However, two observations caught my attention:

a) There were quite a lot of novel transcripts forming fusion genes in the single-cell data. b) This one gene, TMSB4X (located on chrX), kept showing up fused with other genes.

Interestingly, these particular fusions involving novel transcripts or TMSB4X were absent from the bulk RNA-seq data. What's more, the TMSB4X fusions are shared across the three different experimental conditions I have. However, since they are absent in the bulk data, and these fusions haven't been reported in the literature or in fusion databases, I'm kind of scratching my head trying to figure out if these fusions are real or artifacts caused by the Smart-seq2 protocol. Do you have any idea why this might have happened? Any insights would be awesome!

If you have a chance to run Arriba on another Smart-seq2 dataset, I'd love to hear what you find.

Thanks again,

S.

Sogand65 commented 1 month ago

Hi Sebastian,

I am also interested in using Arriba to find fusions for my single cell dataset since I could not find any reliable option for single cell data right now and in some benchmarking articles (here and here) I have seen people used arriba on single cell data to benchmark with their method and it shows very faire and reliable outcomes. I am thinking of a workflow and am not sure if it makes sense to other people in the field as well, so I really appreciate your input and suggestions. so what I plan to do is: I have a single cell data sequenced with Illumina machine and aligned with cellranger algorithm, however, I am thinking of using arriba workflow by using STARsolo, extract the fusion gene transcript from the expected and high confidence fusion genes and add them to my reference genome and use it on cellranger and realign and build my expression matrix this time containing fusion genes as well. Does this workflow make sense to you? do you think the output of STARsolo can be feed into Arriba?

I appreciate your input and help, Sogi

Sogand65 commented 1 month ago

Hi Sebastian,

Thanks so much for getting back to me!

I went ahead and tried Arriba on my Smart-seq2 data (no UMIs), using the default settings. As suggested in the Arriba manual, I undertook further filtering to refine the results. Specifically, I:

  • Eliminated fusions with a low number of supporting reads, which were already classified as low confidence.
  • Discarded fusions involving identical genes (gene 1 and gene 2) and those containing ribosomal genes.
  • Removed fusions present in a small number of cells.

I also analysed matched bulk RNA-seq data using Arriba. I was quite encouraged to see significant overlap in the fusion findings between the single-cell and bulk RNA-seq datasets. However, two observations caught my attention:

a) There were quite a lot of novel transcripts forming fusion genes in the single-cell data. b) This one gene, TMSB4X (located on chrX), kept showing up fused with other genes.

Interestingly, these particular fusions involving novel transcripts or TMSB4X were absent from the bulk RNA-seq data. What's more, the TMSB4X fusions are shared across the three different experimental conditions I have. However, since they are absent in the bulk data, and these fusions haven't been reported in the literature or in fusion databases, I'm kind of scratching my head trying to figure out if these fusions are real or artifacts caused by the Smart-seq2 protocol. Do you have any idea why this might have happened? Any insights would be awesome!

If you have a chance to run Arriba on another Smart-seq2 dataset, I'd love to hear what you find.

Thanks again,

S.

Hi Silvia,

I have seen you used Arriba for the single cell data, I am also interested in using it since the single cell algorithms did not work previously like scFusion, ... can you please tell me about your workflow because I guess the STAR is meant to work with bulk RNA, did you used soloStar instead? can you help me a bit to understand the workflow a bit. I appreciate any input and help.

Best, Sogi

suhrig commented 1 month ago

Hi Sogi,

Do you have 10X Genomics scRNA-seq data or full-length transcripts? You're using Cell Ranger for alignment which suggests the former. The problem with 10X data is that the method only captures the 3' ends of transcripts. Most fusion breakpoints are closer to the 5' end, however. You will miss the vast majority of fusions with this technology, regardless of which tool you use for fusion calling. The evidence for fusions is simply not captured by the technology.

In contrast, if you're using full-length data, then you should be able to use the standard Arriba workflow as in the script run_arriba.sh. You may find a few more false positives, but it should work well enough.

Let me know if you need more detailed instructions.

Best regards, Sebastian

suhrig commented 1 month ago

If it's a 5' library, things are slight better, albeit not much I guess. While it is true that there are more fusion breakpoints near the 5' end, the main limitation of 10X remains: you only get coverage around the start/end of the transcript, but not the full transcript; unless you have long reads, which I'm guessing you haven't.

As a quick check whether you can detect fusions from the data at all, you can run the following command to search the FastQ files for evidence of BCR-ABL1 fusions directly:

zgrep -P 'CAGAGTTCAAAAGCCCTTCA|TGAAGGGCTTTTGAACTCTG' *.fastq.gz

If this returns nothing, then I doubt there is any way to detect fusions at all. Note that there are several variants of BCR-ABL1 with different breakpoints. The above command searches for only one of them (and its reverse complement). You may want to search for others, too.

Frankly, I do not know if the alignments from STARsolo or Cell Ranger can be used for fusion detection. I'm pretty sure Cell Ranger's cannot, because as far as I know Cell Ranger does not make use of the STAR parameters for chimeric alignments. Perhaps, STARsolo could be run manually with the parameters enabled. Specifically, you will need to set the parameters listed here:

https://arriba.readthedocs.io/en/latest/workflow/#demo-script

Sogand65 commented 1 month ago

Hi Sebastian,

Sorry I accidentally deleted my previous message! Thanks a lot for the useful explanations and the very handy quick test, yes my ph+ AML sample returned reads that had the sequence so I guess this is promising. Well since I do not need the alignment, gene expression, and demultiplexing part I will use usual STAR and will use the out.bam on cause I only need the transcripts of the fusion genes that Arriba returns, if that makes sense? I will use this transcript and add them on the ref in cellranger so this way I will have the fusions and cellranger will provide demultiplexed gene expression matrices. So I was thinking not to use STARsolo and just use the STAR + Arriba, but am not really sure if this approach is correct?

Thanks so much for your helps and valuable tool you provided.

Best, Sogi

suhrig commented 1 month ago

Yes, it should be fine to use the regular STAR for fusion calling. Just make sure NOT to pass the FastQ containing the UMIs+barcodes to STAR. Only pass the transcript read, which would effectively be a regular single-end alignment workflow then.

In a more sophisticated version of this workflow, you could use the UMIs+barcodes for more accurate duplicate marking, which should boost sensitivity.

Sogand65 commented 1 month ago

Thanks! in the second part (sophisticated version) you mean using STARsolo? since UMIs are involved?

suhrig commented 1 month ago

What I mean by that is that the alignments in the BAM file need to be marked as duplicates (flag BAM_FDUP set) in accordance with them having the same mapping coordinate, UMI, and cell barcode. I'm not sure if STARsolo does this, since I haven't used it yet.

When duplicates are marked this way, you can disable duplicate detection by Arriba using the parameter -u.

This should be more accurate than having Arriba mark the duplicates, because Arriba only inspects the mapping coordinates. It ignores UMIs/cell barcodes.

Sogand65 commented 1 month ago

Hi Sebastian,

Thanks a lot for your explanations. I already tried regular STAR + Arriba and it gave me some fusions as expected and some new (that can be either novel or false positive) using only fastqs containing no barcodes (R2) as you suggested. However STAR returned 0 bam files for many of my samples (30/46) and I could not find the problem. I am going to try STARsolo with the arguments required for Arriba and will report the results here whether the SRARsolo + Arriba will work.

Thanks, Sogi

suhrig commented 1 month ago

However STAR returned 0 bam files for many of my samples

Probably, the reads need trimming of 10X adapters before alignment. Otherwise, STAR may discard them, because only part of the reads can be aligned. STARsolo will probably deal with this automatically, but I don't know if it is compatible with the chimeric alignment parameters.

Sogand65 commented 1 month ago

Yes probably is the case! Thanks for your thoughts on this, it is very helpful! I’ll probably check STRsolo and will report back how it works! Unfortunately I did not get any responses from STAR github regarding my questions.

best Sogi

Sogand65 commented 2 weeks ago

Hi Sebastian,

Thanks again for providing such a nice tool. I could successfully run STAR(regular) and Arriba on my scRNA-seq (from 10X) data and I got nice fusion genes that actually many of them were in agreement with the cytogenetic information of my samples. In the next step as we discussed I am interested in adding those fusion genes from Arriba output to my reference using cellranger make reference function so I can use the fusion genes in my downstream analysis and navigate them on cells. For that I need to build .fa and .gtf file. For .fa I will use the fusion transcript (mostly in case of Novel genes). However my question is regarding building the .gtf file, I need to decide on the strand direction for the fusion gene. For example if I have : gene1 gene2 +/- -/+ Do you know how I can decide on choosing only one strand direction for the fusion gene of gene1_gene2? cause I only can have one direction in that file. According to Arriba documentation gene1 is the 5' end and gene2 is 3', so I was thinking to choose always + as the fusion strand in .gtf, but I do not know if this is a right approach?

I very appreciate your input and suggestion on this.

Thanks for your help. Sogand

suhrig commented 2 weeks ago

Is your plan to append the sequences of the fusion genes as separate contigs to the FastA file and then define GTF records for these new contigs.

Sogand65 commented 2 weeks ago

yes I wanna generate one .fa file that contains all selected fusions and create the .gtf file for them and will add them to my regular .fa and .gtf file that is being used to cellranger as described here: https://www.10xgenomics.com/support/software/cell-ranger/latest/tutorials/cr-tutorial-mr#add-marker-gene

suhrig commented 2 weeks ago

In that case, it would be the easiest to run Arriba with -I. Arriba will then do it's best to report as much info about the fusion sequence as possible in the column fusion_transcript. You can copy and paste this sequence into the FastA file (more or less; you will need to remove special characters).

This approach should be much better than building the fusion sequence yourself from the breakpoints and strand info. Building the GTF file is also a no-brainer then: All you need to do is define the entire fusion transcript as one big exon and you can always use the + strand. The sequence represents the spliced transcript. So there is no need to annotate introns and you don't have to worry about which strand it is, because the sequence always represents the sense (+) strand.

Sogand65 commented 2 weeks ago

Hi Sebastian,

Thanks so much this was very helpful. I will rerun the Arriba with -I and use its transcript after some cleaning.

Just one last question, what did you mean by: "define the entire fusion transcript as one big exon" in the .gtf file? below is an example of a .gtf in cellranger, which col you suggest to add the fusion sequence/exon seq

chromosome/contig: I choose this as name of fusion (gene1_gene2) source:unknown Feature:exon start:1 end:922 source: . strand: + Frame: . attributes: gene_id "GFP"; transcript_id "GFP"; gene_name "GFP"; gene_biotype "protein_coding";

Thanks a lot, Sogand

suhrig commented 2 weeks ago

what did you mean by: "define the entire fusion transcript as one big exon" in the .gtf file?

Let's say one of the fusion sequences is 1000 bases long. You first need to add the sequence as an additional contig to the FastA file. Then, you need to add a GTF record with start and end coordinates 1 and 1000. Actually, you need three: one for the gene, one for the transcript and one for the exon. All three will have the same coordinates (1-1000). Does that make sense?

Sogand65 commented 1 week ago

Dear Sebastian,

Yes I guess it makes a lot sense but in the cell ranger, it is noted that cell ranger count only uses rows with Features as exon and it won't read other lines, I should only specify exon. So according to what you explained and the cell ranger limitation as an example if I have: fasta as:

gene1_gene2 ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG... my .gtf would be: gene1_gene2 Arriba exon 1 1000 . + . gene_id "gene1_gene2"; transcript_id "gene1_gene2"; gene_name "gene1_gene2"; gene_biotype "protein_coding";

The 10X told me it is safe to choose gene_biotype "protein_coding" so it won't filter it or ... Thanks so much for your help.

Best, Sogand

suhrig commented 1 week ago

Yes, this looks good to me!

Sogand65 commented 1 week ago

Thanks so much Sebastian!

Best, Sogand