suhrig / arriba

Fast and accurate gene fusion detection from RNA-Seq data
Other
213 stars 50 forks source link

Failure to detect confirmed fusion #97

Closed maximus3219 closed 7 months ago

maximus3219 commented 3 years ago

Hi,

I have the following cases that have fusions confirmed by FISH and RT-PCR. However these fusions cannot be picked up by Arriba, not even in the discarded list. However, these fusion can be called by the Illumina "RNA Amplicon workflow" pipeline. I wonder what is the problem involved and how I could get these fusions.

E6226R: EZR-ROS1 Fusion ROS204: SLC34A2-ROS1 Fusion E6093R, E6099R: KIF5B-RET fusion

Here is the link to the google drive containing the FASTQ files for you to test. https://drive.google.com/drive/folders/1EktZe2QMemJT-OiGlhlqDF_CHpjDBnYJ?usp=sharing

Thank you.

Best Regards, Maximus

suhrig commented 3 years ago

Hi Maximus,

Thanks for providing the FastQs. That was very helpful in pinning down the problem. The sequences in your samples contain long stretches of adapter sequences. You should remove them before alignment, because STAR fails to align them otherwise. If you have a look at the STAR log file, you will see that it says something like % of reads unmapped: too short | 99%. This is because 99% of the reads are largely made up of adapter sequence rather than human sequence. I used Trimmomatic to remove them:

java -jar trimmomatic-0.39.jar PE -threads 4 -trimlog trim.log -summary trim.summary \
    E6226R_S16_L001_R1_001.fastq.gz E6226R_S16_L001_R2_001.fastq.gz \
    E6226R_trimmed_paired_1.fastq E6226R_trimmed_unpaired_1.fastq \
    E6226R_trimmed_paired_2.fastq E6226R_trimmed_unpaired_2.fastq \
    ILLUMINACLIP:/path/to/adapters/NexteraPE-PE.fa:2:30:10:2:true \
    LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

Make sure to substitute /path/to/adapters/NexteraPE-PE.fa with the actual path to the adapter sequences. You can then feed the files E6226R_trimmed_paired_1.fastq and E6226R_trimmed_paired_1.fastq to run_arriba.sh (after gzip-ing them). In my tests, all four fusions were detected this way.

Regards, Sebastian

maximus3219 commented 3 years ago

Hi Sebastian,

Thanks for finding out the root problems!

However, after I performed trimmomatic, one (E6093R) can detect the intended fusion on final output, but the other three (E6226R, E6099R, ROS204) can only find the fusions in the discarded files. Is that the same result that you get?

Thank you.

Best Regards, Maximus

suhrig commented 3 years ago

Hi Maximus,

I could rediscover the fusions in all four cases. So something must be suboptimal in your environment. Did you use the same trimmomatic command that I gave to you? When I ran it, about 75% of the reads were left after trimming. Can you confirm this? Are you using the recommended alignment parameters? What version of STAR do you use? I used 2.7.6a. What is your assembly/annotation? Mine is hs37d5 + RefSeq viral genomes and GENCODEv19.

Regards, Sebastian

maximus3219 commented 3 years ago

Hi Sebastian,

I use exactly the same trimmomatic command as you did, with the intended fusion being discarded in 3 cases. However, if I only remove adaptor sequence but NOT the low quality base by removing "LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36", I can rediscover the fusions in all four cases.

I am using STAR version 2.7.7a. Gencode version 36 assembly /annotation.

Thank you.

Best Regards, Maximus

On Wed, 27 Jan 2021 at 01:31, suhrig notifications@github.com wrote:

Hi Maximus,

I could rediscover the fusions in all four cases. So something must be suboptimal in your environment. Did you use the same trimmomatic command that I gave to you? When I ran it, about 75% of the reads were left after trimming. Can you confirm this? Are you using the recommended alignment parameters https://arriba.readthedocs.io/en/latest/workflow/#demo-script? What version of STAR do you use? I used 2.7.6a. What is your assembly/annotation? Mine is hs37d5 + RefSeq viral genomes and GENCODEv19.

Regards, Sebastian

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/suhrig/arriba/issues/97#issuecomment-767702920, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASNFRN5NDG6DLSDG77ES52TS3335LANCNFSM4WSXG5YQ .

suhrig commented 3 years ago

Interesting. I will try with your STAR/Gencode version and see if I can reproduce the discrepancy due to different trimmomatic filters. I should find the time for this within the next days.

suhrig commented 3 years ago

When I use STAR 2.7.7a and Gencode 36 assembly/annotation, the fusion is missed only in one sample (ROS204). So something is still different in your environment. However, I don't think this matters too much, because during my tests I found a bug in STAR that has much bigger impact on the results than the trimming parameters: STAR occasionally assigns the supplementary alignment to the wrong read of the two paired-end reads. This is only triggered when the mates fully overlap, because when mates overlap, STAR first merges them to a virtual single-end read, aligns this merged read, and disentangles the merged read back to two paired-end reads. When paired-end reads fully overlap, the sequence of the merged read is identical to the sequences of both paired-end reads, such that STAR cannot unambiguously undo the merging. As a result, the supplementary alignment gets assigned randomly to mate1 or to mate2. Only one assignment is correct, however. Arriba ignores malformed SAM records (note the warning that Arriba emits), such that about half of the chimeric alignments are lost, which means that half of the evidence is lost. Of course, this is detrimental to sensitivity. I will report this issue to the developer of STAR or submit a pull request to fix this.

Until then, I recommend you do the following: Run Trimmomatic with the parameter keepBothReads set to false (it's the sixth parameter to ILLUMINCLIP):

java -jar trimmomatic-0.39.jar PE -threads 4 -trimlog trim.log -summary trim.summary \
    E6226R_S16_L001_R1_001.fastq.gz E6226R_S16_L001_R2_001.fastq.gz \
    E6226R_trimmed_paired_1.fastq E6226R_trimmed_unpaired_1.fastq \
    E6226R_trimmed_paired_2.fastq E6226R_trimmed_unpaired_2.fastq \
    ILLUMINACLIP:/path/to/adapters/NexteraPE-PE.fa:2:30:10:2:false

This instructs Trimmomatic to discard one of the mates if they fully overlap. Instead of being stored as paired-end reads in the files E6226R_trimmed_paired_1.fastq and E6226R_trimmed_paired_2.fastq, only one of them is stored - either in E6226R_trimmed_unpaired_1.fastq or in E6226R_trimmed_unpaired_2.fastq (depending on which mate Trimmomatic decides to keep). In my tests, about 95% of the reads were affected by this and only 5% were written to the paired files. If you want to keep things simple, you can ignore the 5% of paired-end reads that are left, since the vast majority of the information is contained in the unpaired files. So you can simply concatenate the unpaired files into one (single-end) file and align them using STAR in single-end mode. You should notice a substantial increase in the number of supporting reads for the expected fusions. In my tests there were 2-3 times as many as before (see columns split_reads1 and split_reads2). Optionally, if you want to make use of all the data, you can additionally align the 5% of pairend-end reads contained in the paired files using STAR in paired-end mode. You should then merge the single-end and paired-end BAMs into a single BAM file to be passed on to Arriba. Arriba can deal with mixed single-end and paired-end data. You should not pass the single-end and paired-end BAMs separately to Arriba, because the paired-end BAM contains hardly any information and processing this data alone is hardly meaningful.

It doesn't matter too much if you use the quality trimming parameters or not. Since you have made better experiences without them, it is probably better to drop them.

BTW, the reason that your reads contain lots of adapter sequences is the same reason why the paired-end reads fully overlap: The insert size of your libraries is smaller than the read length. This is not good and should be fixed if possible. You should talk to the people who prepare the libraries and ask them to select for larger fragments. Otherwise, there is no point in doing paired-end sequencing. If mate1 and mate2 contain exactly the same information in 95% of the cases, then you might as well do single-end sequencing. If your libraries had a bigger insert size, you could eliminate both issues (adapter sequences and the STAR bug) at the same time. More importantly, in addition to that, you would improve the sensitivity of fusion detection, because at the moment half of the information in your samples is redundant (because mate1 equals mate2) and because having larger fragments means you have more coverage around the fusion junctions, which is a massive boost to sensitivity.

suhrig commented 3 years ago

Hi Maximus,

I had a look at the STAR code, but couldn't come up with a fix, because I don't understand the code well enough. The issue is also hard to reproduce. Depending on the order of the data that you feed into STAR and how some parameters are set and even how you compile the code, the same read may be aligned correctly or incorrectly. So I cannot just send the developer of STAR an example read that he can use for testing. You need a couple of thousand reads to find some that trigger the bug and it's not predictable which ones will trigger the bug. Some may trigger it in my environment, but that does not mean that the same ones will trigger the bug in the environment of the developer of STAR. I could come up with a method to auto-generate test data, but it would be much less work if I could just forward your existing data to the developer of STAR. Would you be okay with me forwarding him the link to your samples?

Regards, Sebastian

suhrig commented 3 years ago

Nevermind. I found an easy way to generate test data myself.

maximus3219 commented 3 years ago

Hi Sebastian,

Thanks for the detailed investigation and finding the bug in STAR that causes the problem.

Feel free to share my fastq files for further testings.

Best Regards, Maximus


From: suhrig notifications@github.com Sent: Sunday, January 31, 2021 11:44:27 PM To: suhrig/arriba arriba@noreply.github.com Cc: maximus3219 maximus3219@gmail.com; Author author@noreply.github.com Subject: Re: [suhrig/arriba] Failure to detect confirmed fusion (#97)

Nevermind. I found an easy way to generate test data myself.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/suhrig/arriba/issues/97#issuecomment-770401983, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ASNFRN6AOIJSNSFDBUZB523S4V3FXANCNFSM4WSXG5YQ.

suhrig commented 3 years ago

Okay, thanks. I'm closing this issue, since all that can be done from my side has been said. You can follow the bug report in the STAR issue tracker here if you want: https://github.com/alexdobin/STAR/issues/1135

suhrig commented 2 years ago

Hi Maximus,

I just want to let you know that I have submitted a fix for this issue to the STAR code repository at end of last year. As of STAR version 2.7.10a, which has just been released, fully overlapping reads no longer produce malformed alignments.

Trimming of reads with high adapter content is still recommended as explained. But the workaround described above it is no longer needed if you use the new STAR version. It is no longer necessary to align libraries with small insert size as a combination of single-end reads and paired-end reads.

Regards, Sebastian

maximus3219 commented 2 years ago

Hi Sebastian,

Thanks for the information.

I have been working with RNA-seq on clinical samples using Illumina Trusight Pan-cancer panel targeting 1385 genes. Recently I come across with a fusion call of MTTL10-GLUD1 with high confidence (similar situation occurs in STAR-Fusion) (see attached fusion call file) But both these genes are NOT among the 1385 targeted genes, so probably represent false positives. I am using STAR version 2.7.7a, Arriba 2.1.0 and STAR-fusion 1.9. Another colleague using a much older pipeline (STAR v 2.4.0d, STAR Fusion v.0.5.4) does not call this fusion.

I am not sure what the problem is. Is there an option to call only those fusion involving the targeted genes?

Thank you.

Best Regards, Maximus

On Sat, 15 Jan 2022 at 22:27, suhrig @.***> wrote:

Hi Maximus,

I just want to let you know that I have submitted a fix for this issue to the STAR code repository at end of last year. As of STAR version 2.7.10a, which has just been released, fully overlapping reads no longer produce malformed alignments.

Trimming of reads with high adapter content is still recommended as explained. But the workaround described above it is no longer needed if you use the new STAR version. It is no longer necessary to align libraries with small insert size as a combination of single-end reads and paired-end reads.

Regards, Sebastian

— Reply to this email directly, view it on GitHub https://github.com/suhrig/arriba/issues/97#issuecomment-1013691154, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASNFRN6SEIJR5DHCC2FIY5DUWF745ANCNFSM4WSXG5YQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

suhrig commented 2 years ago

Hi Maximus,

I come across with a fusion call of MTTL10-GLUD1

I guess you meant to write MLLT10, which is part of the Illumina Trusight panel. Is this a leukemic sample? MLLT10 fusions are common alterations in leukemia. Nevertheless, the fusion partner GLUD1 is not common, so it could still be a false positive. If it's not a leukemic sample, then it is most likely a false positive, because MLLT10 fusions do not occur in other types of cancer as far as I know.

Is there an option to call only those fusion involving the targeted genes?

There is no such option, but you can use the following awk one-liner for this purpose, assuming the file panel.txt contains the list of genes covered by the panel and fusions.tsv are the fusion calls by Arriba. Make sure the panel.txt file contains gene names recognized by whatever GTF file you used to run Arriba (parameter -g).

awk 'FILENAME~/panel.txt/{list[$0]} FILENAME!~/panel.txt/ && ($1 in list || $2 in list)' panel.txt fusions.tsv

I am not sure what the problem is.

Arriba was developed for unbiased RNA-Seq. Targeted sequencing produces very high coverage in a small set of regions, whereas Arriba's statistical model assumes coverage of all expressed genes. Thus, there is a mismatch between the number of fusion-supporting reads produced by panels and the number that Arriba expects and considers high enough to make a call. I recently wrote a section on pitfalls when using Arriba on panel data. You may find some useful tips there: https://arriba.readthedocs.io/en/latest/current-limitations/#targeted-sequencing

I don't have a clear explanation why the fusion is not called by earlier versions of STAR-Fusion. It is probably due to tweaking of filters. This does not mean, however, that older versions of STAR-Fusion are better suited for panels.

Regards, Sebastian

maximus3219 commented 2 years ago

Hi Sebastian,

I am wondering why the protein domains of GLI1 gene are not labelled in this case as attached, as well as other samples.

Thank you.

Best Regards, Maximus Yeung

On Fri, 28 Jan 2022 at 14:41, suhrig @.***> wrote:

Hi Maximus,

I come across with a fusion call of MTTL10-GLUD1

I guess you meant to write MLLT10, which is part of the Illumina Trusight panel. Is this a leukemic sample? MLLT10 fusions are common alterations in leukemia. Nevertheless, the fusion partner GLUD1 is not common, so it could still be a false positive. If it's not a leukemic sample, then it is most likely a false positive, because MLLT10 fusions do not occur in other types of cancer as far as I know.

Is there an option to call only those fusion involving the targeted genes?

There is no such option, but you can use the following awk one-liner for this purpose, assuming the file panel.txt contains the list of genes covered by the panel and fusions.tsv are the fusion calls by Arriba. Make sure the panel.txt file contains gene names recognized by whatever GTF file you used to run Arriba (parameter -g).

awk 'FILENAME~/panel.txt/{list[$0]} FILENAME!~/panel.txt/ && ($1 in list || $2 in list)' panel.txt fusions.tsv

I am not sure what the problem is.

Arriba was developed for unbiased RNA-Seq. Targeted sequencing produces very high coverage in a small set of regions, whereas Arriba's statistical model assumes coverage of all expressed genes. Thus, there is a mismatch between the number of fusion-supporting reads produced by panels and the number that Arriba expects and considers high enough to make a call. I recently wrote a section on pitfalls when using Arriba on panel data. You may find some useful tips there: https://arriba.readthedocs.io/en/latest/current-limitations/#targeted-sequencing

I don't have a clear explanation why the fusion is not called by earlier versions of STAR-Fusion. It is probably due to tweaking of filters. This does not mean, however, that older versions of STAR-Fusion are better suited for panels.

Regards, Sebastian

— Reply to this email directly, view it on GitHub https://github.com/suhrig/arriba/issues/97#issuecomment-1023930763, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASNFRN3QPFK6NWN4FCVVCJDUYI3AHANCNFSM4WSXG5YQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

suhrig commented 2 years ago

Hi Maximus,

Arriba's protein domain annotation is based on Pfam, because Pfam is free to redistribute, in contrast to UniProt, for example. Apparently, Pfam does not annotate the GLI1 protein domains.

In order to contribute new annotation to the Pfam database, please take a look at the FAQ. Alternatively, you can build your own GFF3 file from a more comprehensive source of protein domain annotation and feed it to Arriba. But this is not going to be easy. As a quick fix, you can simply add your protein domain of interest to the GFF3 file provided with Arriba.

Regards, Sebastian

maximus3219 commented 9 months ago

Hi Sebestian,

I have got a case that is supposed to have GLI1 fusion.

When I perform RNA Trusight pan-cancer panel, both Arriba and STAR-fusion failed to detect any fusion related to GLI1. This is a hybrid-capture platform, so even if the fusion partner is not within the panel, the fusion transcripts should still be captured and detected.

Could you please help see what is the underlying cause?

I have attached fastq files for your reference.

Thank you.

Best Regards, Maximus 23BX16587_S3_L001_R1_001.fastq.gz https://drive.google.com/file/d/1yXt3U0TNc91g_WtnZUU6UA1hrjKcIUUd/view?usp=drive_web 23BX16587_S3_L001_R2_001.fastq.gz https://drive.google.com/file/d/1mYUEyY77-rbi4CejX4YFcQAc8aSZacM-/view?usp=drive_web

On Tue, 22 Mar 2022 at 01:10, suhrig @.***> wrote:

Hi Maximus,

Arriba's protein domain annotation is based on Pfam, because Pfam is free to redistribute, in contrast to UniProt, for example. Apparently, Pfam does not annotate the GLI1 protein domains.

In order to contribute new annotation to the Pfam database, please take a look at the FAQ http://pfam.xfam.org/help#tabview=tab4. Alternatively, you can build your own GFF3 file from a more comprehensive source of protein domain annotation and feed it to Arriba. But this is not going to be easy. As a quick fix, you can simply add your protein domain of interest to the GFF3 file provided with Arriba.

Regards, Sebastian

— Reply to this email directly, view it on GitHub https://github.com/suhrig/arriba/issues/97#issuecomment-1074173055, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASNFRNYQCJDEHINELURUSG3VBCUPXANCNFSM4WSXG5YQ . You are receiving this because you authored the thread.Message ID: @.***>

maximus3219 commented 9 months ago

Hello Sebestian,

I have granted access for you to the fastq files.

Apart from this case, I have actually encountered case with well-known COL1A1::PDGFB fusion in DFSP, which is discovered by STAR-fusion but missed by Arriba, not sure the reason.

Thank you.

Best Regards, Maximus

On Thu, 28 Sept 2023 at 18:47, Maximus Yeung @.***> wrote:

Hi Sebestian,

I have got a case that is supposed to have GLI1 fusion.

When I perform RNA Trusight pan-cancer panel, both Arriba and STAR-fusion failed to detect any fusion related to GLI1. This is a hybrid-capture platform, so even if the fusion partner is not within the panel, the fusion transcripts should still be captured and detected.

Could you please help see what is the underlying cause?

I have attached fastq files for your reference.

Thank you.

Best Regards, Maximus 23BX16587_S3_L001_R1_001.fastq.gz https://drive.google.com/file/d/1yXt3U0TNc91g_WtnZUU6UA1hrjKcIUUd/view?usp=drive_web 23BX16587_S3_L001_R2_001.fastq.gz https://drive.google.com/file/d/1mYUEyY77-rbi4CejX4YFcQAc8aSZacM-/view?usp=drive_web

On Tue, 22 Mar 2022 at 01:10, suhrig @.***> wrote:

Hi Maximus,

Arriba's protein domain annotation is based on Pfam, because Pfam is free to redistribute, in contrast to UniProt, for example. Apparently, Pfam does not annotate the GLI1 protein domains.

In order to contribute new annotation to the Pfam database, please take a look at the FAQ http://pfam.xfam.org/help#tabview=tab4. Alternatively, you can build your own GFF3 file from a more comprehensive source of protein domain annotation and feed it to Arriba. But this is not going to be easy. As a quick fix, you can simply add your protein domain of interest to the GFF3 file provided with Arriba.

Regards, Sebastian

— Reply to this email directly, view it on GitHub https://github.com/suhrig/arriba/issues/97#issuecomment-1074173055, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASNFRNYQCJDEHINELURUSG3VBCUPXANCNFSM4WSXG5YQ . You are receiving this because you authored the thread.Message ID: @.***>

suhrig commented 9 months ago

I don't see strong evidence for a fusion involving GLI1 in the sample you sent me. Do you have any more info about the fusion, such as the fusion partner or (ideally) the breakpoint coordinates?

Regarding the COL1A1 fusion: Arriba filters fusions involving this gene quite stringently, because the gene is often expressed at a high level and gives rise to many artifacts. My guess is that's the reason. Do you see it in the discarded file? The filters column should tell you the reason. It makes sense to look at the discarded candidate with the highest number of supported reads.

maximus3219 commented 9 months ago

This sample shows GLI1 breakapart and amplification of 3' end by FISH breakapart probe. If we view the bam files in the IGV, we see there are reads in 3' portion but not the 5' portion, suggestive with translocation involving GLI1 gene. RNA-seq is done to detect the fusion partner, apparently it is not detected.... [image: image.png]

And yes, the COL1A1::PDGFB fusion is found in discarded file. But this is less of a problem, because it can be detected by STAR-fusion.

Thank you.

On Thu, 5 Oct 2023 at 02:12, suhrig @.***> wrote:

I don't see strong evidence for a fusion involving GLI1 in the sample you sent me. Do you have any more info about the fusion, such as the fusion partner or (ideally) the breakpoint coordinates?

Regarding the COL1A1 fusion: Arriba filters fusions involving this gene quite stringently, because the gene is often expressed at a high level and gives rise to many artifacts. My guess is that's the reason. Do you see it in the discarded file? The filters column should tell you the reason. It makes sense to look at the discarded candidate with the highest number of supported reads.

— Reply to this email directly, view it on GitHub https://github.com/suhrig/arriba/issues/97#issuecomment-1747404668, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASNFRN63S53KVOIU7B5RHWDX5WRJ7AVCNFSM4WSXG5Y2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCNZUG42DANBWGY4A . You are receiving this because you authored the thread.Message ID: @.***>

suhrig commented 8 months ago

Unfortunately, your image didn't make it to GitHub. When I look at the coverage of the data you sent me, I see coverage over the entire gene. IMG_20231005_221628

maximus3219 commented 8 months ago

I attach the image as an attachment here. Seems exon 1-3 has much reduced coverage compared with others (in both my images and yours). Anyways maybe this case has mainly GLI1 amplification and other genomic alterations, but not GLI1 translocation. Thanks.

Best Regards, Maximus

On Fri, 6 Oct 2023 at 04:58, suhrig @.***> wrote:

Unfortunately, your image didn't make it to GitHub. When I look at the coverage of the data you sent me, I see coverage over the entire gene. [image: IMG_20231005_221628] https://user-images.githubusercontent.com/10560654/273038713-97893a74-e101-4361-b828-a853cb70e445.jpg

— Reply to this email directly, view it on GitHub https://github.com/suhrig/arriba/issues/97#issuecomment-1749632734, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASNFRN7YS62QDAD5FALB6CTX54NPLAVCNFSM4WSXG5Y2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCNZUHE3DGMRXGM2A . You are receiving this because you authored the thread.Message ID: @.***>

suhrig commented 8 months ago

And yes, the COL1A1::PDGFB fusion is found in discarded file.

What does it say in the filters column for the candidate with the highest number of reads?

I attach the image as an attachment here. Seems exon 1-3 has much reduced coverage compared with others

It's not unusual to see fluctuating coverage in RNA-Seq data as in this example. It could be normal. Have you compared the coverage of this sample to other samples? Maybe it's common that the first few exons have less coverage. This could be caused by the library prep method and other factors.

maybe this case has mainly GLI1 amplification and other genomic alterations, but not GLI1 translocation.

There could be a enhancer hijacking translocation upstream of GLI1 placing the gene near another activating enhancer. This would explain the break apart results and would also result in upregulation of GLI1. You wouldn't see this as a fusion in RNA-Seq data, however, because the breakpoint is outside GLI1.

maximus3219 commented 7 months ago

Hi Sebestian,

I have got 2 cases of solitary fibrous tumour (SFT) confirmed by positive STAT6 by immunohistochemistry staining. They should have underlying NAB2::STAT6 fusion, which is not detectable by both Arriba and STAR-fusion. I cannot find this fusion in the discarded file also. Do you have any idea why this is the case?

Thank you.

Best Regards, Maximus

On Thu, 19 Oct 2023 at 06:16, suhrig @.***> wrote:

And yes, the COL1A1::PDGFB fusion is found in discarded file.

What does it say in the filters column for the candidate with the highest number of reads?

I attach the image as an attachment here. Seems exon 1-3 has much reduced coverage compared with others

It's not unusual to see fluctuating coverage in RNA-Seq data as in this example. It could be normal. Have you compared the coverage of this sample to other samples? Maybe it's common that the first few exons have less coverage. This could be caused by the library prep method and other factors.

maybe this case has mainly GLI1 amplification and other genomic alterations, but not GLI1 translocation.

There could be a enhancer hijacking translocation upstream of GLI1 placing the gene near another activating enhancer. This would explain the break apart results and would also result in upregulation of GLI1. You wouldn't see this as a fusion in RNA-Seq data, however, because the breakpoint is outside GLI1.

— Reply to this email directly, view it on GitHub https://github.com/suhrig/arriba/issues/97#issuecomment-1769482776, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASNFRN5VZK2OWP2UOVKC2CTYABILTAVCNFSM4WSXG5Y2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCNZWHE2DQMRXG43A . You are receiving this because you authored the thread.Message ID: @.***>

suhrig commented 7 months ago

NAB2-STAT6 fusions are usually very easy to detect. They tend to have a massive number of supporting reads. So it's strange they were picked up by neither tool. Possibly, there is something odd about the reads, which prevents proper alignment (maybe many reference mismatches or adapter sequences?).

  1. Have you checked the common breakpoints in IGV for evidence of fusions?

hg19 coordinates:

12:57486752 12:57502083
12:57487382 12:57492862
12:57487382 12:57493224
  1. In addition, you may want to grep the FastQ files for the junction sequences:
GAAACAAGAGGCAACCTCCA
CCTCTCGCAGCTGAACAGAT
CCTCTCGCAGGCTCTCCACA

If neither of those shows evidence, then have you ruled out the possibility of a sample swap (by checking germline SNPs or by checking if the expression profile of the tumor really matches that of a SFT)? Do you happen to have a matched DNA-Seq sample which you could inspect for structural variants near the typical fusion breakpoints above?

maximus3219 commented 7 months ago

The Uniquely mapped reads is 92%, so the reads should be mostly properly aligned. And I have tried preprocessing with trimming adaptor and the final results are the same.

The IGV image is as attached. May I know how I should check for evidence of fusion? The NAB2 and STAT6 are neighbouring genes.

Thanks.

On Sun, 3 Dec 2023 at 04:12, suhrig @.***> wrote:

NAB2-STAT6 fusions are usually very easy to detect. They tend to have a massive number of supporting reads. So it's strange they were picked up by neither tool. Possibly, there is something odd about the reads, which prevents proper alignment (maybe many reference mismatches or adapter sequences?).

  1. Have you checked the common breakpoints in IGV for evidence of fusions?

hg19 coordinates:

12:57486752 12:57502083 12:57487382 12:57492862 12:57487382 12:57493224

  1. In addition, you may want to grep the FastQ files for the junction sequences:

GAAACAAGAGGCAACCTCCA CCTCTCGCAGCTGAACAGAT CCTCTCGCAGGCTCTCCACA

If neither of those shows evidence, then have you ruled out the possibility of a sample swap (by checking germline SNPs or by checking if the expression profile of the tumor really matches that of a SFT)? Do you happen to have a matched DNA-Seq sample which you could inspect for structural variants near the typical fusion breakpoints above?

— Reply to this email directly, view it on GitHub https://github.com/suhrig/arriba/issues/97#issuecomment-1837243310, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASNFRN633WCCKXVVI6K4CLTYHODRNAVCNFSM4WSXG5Y2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOBTG4ZDIMZTGEYA . You are receiving this because you authored the thread.Message ID: @.***>

suhrig commented 7 months ago

The screenshots didn't make it to GitHub, unfortunately. Can you post them directly to GitHub instead of replying by mail?

May I know how I should check for evidence of fusion?

There should be soft-clipped reads at the breakpoints.

maximus3219 commented 7 months ago

Attached here please find the screenshot of IGV....not sure if it has got any soft-clipped reads NAB2-STAT6

I can share the fastq files to have a look: https://drive.google.com/drive/folders/1bN3yUEdKNFji3G5VQvo-TgOEKxZHd-mW?usp=sharing

These 2 cases are from an older sample with poor RNA quality, not sure if that affects the results, but the alignment rates are quite good....

Thanks.

suhrig commented 7 months ago

I also cannot find evidence for the fusion. I agree the quality doesn't look too good. Lots of duplicates. The quality is probably the reason why the fusion is not detected.