suhrig / arriba

Fast and accurate gene fusion detection from RNA-Seq data
Other
226 stars 50 forks source link

SAM records were malformed and ignored #133

Closed MazlinaIsmail closed 2 years ago

MazlinaIsmail commented 3 years ago

Hello,

I have been getting a couple of warnings when I run the 'STAR in conjunction with Arriba' method. I would like to understand a bit better how much they affect the fusions output file.

Below is an example:

WARNING: 14146239 SAM records were malformed and ignored WARNING: some fusions were subsampled, because they have more than 300 supporting reads

Not sure if helpful, but here are some additional info about the input file. There is ~125M reads and when I run STAR separately (for other rnaseq analysis), I get 85% uniquely mapped reads.

I guess the second warning is nothing to be too concerned about (?). With the first, I guess it is the same underlying issue in that my input file is quite big, and so the second warning is simply a case of it being too may alignment records? My log files are indicating that Arriba managed to process them (it gets to Done in the log file, and I get the fusions and discarded fusions output files) but I was just wondering if the warnings may indicate some other issues.

Thank you.

Best, Mazlina

suhrig commented 3 years ago

Hi Mazlina,

The second warning (subsampling of fusions) is harmless. It just means that there are more supporting reads for a fusion than necessary. Collecting more reads would not increase the confidence any further. Summarizing the reads for each fusion is very resource-consuming (especially in terms of memory), so subsampling is performed. You can increase the subsampling threshold using the parameter -U if you want. But there is no point in doing so other than the supporting read count becoming more accurate for highly expressed fusions. There is a risk that your machine runs out of memory under some conditions, though, in particular when you process multiple myeloma samples.

The first warning has nothing to do with the size of the sample. I have processed samples with 1 billion reads successfully. It is a bit of a concern. At the same time, it is normal. It depends on the severity whether you should be worried about this. 14 million reads looks serious. You have to put this number in relation to the number of chimeric reads that Arriba reports. Look at the line Reading chimeric alignments from /path/to/file.bam (total=???). The number of reads reported as malformed is the number of reads that could be used for fusion detection, but that are lost because they have an invalid SAM format. For example, if the number of malformed reads is 2 million and the number of successfully loaded chimeric reads (total) is 1 million, it means you are losing 2/3 of the information and thus 2/3 of the detection power. In practice, many of the malformed reads are actually useless (they are mostly clipped adapter sequences), so the problem is not actually that severe; you are not losing 2/3 of the useful reads. However, if a large fraction of the reads is affected by this problem, then fusion detection can suffer. The vast majority of the malformed alignments is caused by a bug in STAR, which affects paired-end alignments with small insert size such that the reads fully overlap. I suspect that your sample also has a small insert size.

I have opened a ticket with the developer a while ago. Hopefully, it will be fixed soon. You can have a look at https://github.com/alexdobin/STAR/issues/1135 to learn more about the nature of the bug & bug fixing progress. And issue #97 - notably this comment - tells you possible workarounds, namely: trimming adapters, and possibly aligning fully overlapping reads in single-end mode and non-overlapping reads in paired-end mode. I am not sure if you want to invest the effort of fully implementing this workaround, or if you simply want to wait for the bug fix to be released. Possibly, the fusion detection is not even impaired that much, despite quite a lot of reads going lost (because most of them are useless adapter sequences). I run Arriba regularly on samples with small insert size, and it still picks up the important fusions. If you notice that Arriba misses a fusion in your samples, you could try the workaround and see if the fusion is then detected.

Regards, Sebastian

MazlinaIsmail commented 3 years ago

Hi Sebastian,

thank you for the detailed explanation, much appreciated.

It is looking like a large fraction of the reads are affected. In the example above where I have 14M malformed, Arriba is reading in (from the 'Reading chimeric alignments' line) about 4.5M. And this is after adaptor trimming (I use Trimmomatic with keepBothReads set to true). I am wondering now if I have carried out this step correctly, will investigate.

I am seeing a fusion I expect to see, however I am also checking if that same fusion is missing in some of the samples. But with this issue I cannot say for sure that it is missing because it is really not there, or it is just not being picked up I think?

Due to space, I have not kept any of the aligned files so not 100% sure whether it is related to insert size. Will look into this to confirm if it is indeed the case. Hopefully the bug would have been fixed by then, otherwise I will try to implement the workaround.

Thank you so much.

Best, Mazlina

suhrig commented 3 years ago

But with this issue I cannot say for sure that it is missing because it is really not there, or it is just not being picked up I think?

True, without ground truth, you won't know for sure. The following may help: If the fusion has recurrent breakpoints, you can grep the FastQ files for the junction sequence (+/- 10bp around the breakpoint should suffice). Make sure to also grep for the reverse complement. This is a very sensitive and a very specific approach in my experience, and it only takes a few minutes per sample, since it requires no alignment. This approach only works if you already know what you are looking for, but it is one of the most reliable ones.

suhrig commented 2 years ago

Hi Mazlina, I'm closing this issue for now, since there is nothing more that I can do from my side. Feel free to reopen if you need help implementing the workaround or otherwise. Regards, Sebastian

MazlinaIsmail commented 2 years ago

Hi Sebastian,

Apologies for not following up, and thank you for doing so. Had temporarily parked the fusion investigation to the side, but am hoping the STAR bug to be fixed by the time I revisit it soon.

If I understood correctly what you meant by ‘recurring breakpoints’, I think it is a no in the case of the fusion I was looking for (the fusion may result from different combinations of exons). I have a subset of the data in which the fusion was not detected from RNAseq, but has stained positive for protein. And then there is another subset which I do not have staining data for, and it is these that I am trying to determine which category they would fall into.

Using the RNAseq data to strictly determine this may not be the best way because these are ’treated’ cases, and the genes involved would be affected by the treatment (range of down-regulation). For now I think I will ignore the group for which I do not have staining for.

Many thanks, Mazlina

On 12 Nov 2021, at 21:26, suhrig @.***> wrote:

Hi Mazlina, I'm closing this issue for now, since there is nothing more that I can do from my side. Feel free to reopen if you need help implementing the workaround or otherwise. Regards, Sebastian

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/suhrig/arriba/issues/133#issuecomment-967588611, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACPLGDBYCVXDM2ISFTKD6H3ULWBAZANCNFSM5GN3NYHQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

suhrig commented 2 years ago

Hi Mazlina,

I just want to let you know that I have submitted a fix for this issue to the STAR code repository at end of last year. As of STAR version 2.7.10a, which has just been released, fully overlapping reads no longer produce malformed alignments.

Note that this fix does not solve all causes of malformed alignments. It only fixes malformed alignments arising from fully overlapping reads. Another very frequent cause of malformed alignments is not fixed by my patch: namely, when the insert size is smaller than the read length, such that paired-end mates extend past each other and contain adapter sequences. You can take a look at my pull request in the STAR repo for more details. However, these malformed alignments are of no concern. They are irrelevant to fusion detection. To my knowledge, STAR version 2.7.10a no longer produces any malformed alignments that impair fusion detection and Arriba's warnings about such alignments are harmless.

Regards, Sebastian

MazlinaIsmail commented 2 years ago

Hi Sebastian,

Thank you for this. I’ve re-ran the processing with updated Arriba and STAR, and did notice a slight improvement in the numbers. We are still a bit unsure for the samples which we do not have staining data for, so this is KIV-ed atm. I do not have anything useful to report back to you for now, but will update if any.

Really appreciate your time and effort with this, I am finding the tool easy to work with and manual easy to follow.

Many thanks, Mazlina

On 15 Jan 2022, at 14:08, suhrig @.***> wrote:

Hi Mazlina,

I just want to let you know that I have submitted a fix for this issue to the STAR code repository at end of last year. As of STAR version 2.7.10a, which has just been released, fully overlapping reads no longer produce malformed alignments.

Note that this fix does not solve all causes of malformed alignments. It only fixes malformed alignments arising from fully overlapping reads. Another very frequent cause of malformed alignments is not fixed by my patch: namely, when the insert size is smaller than the read length, such that paired-end mates extend past each other and contain adapter sequences. You can take a look at my pull request in the STAR repo https://github.com/alexdobin/STAR/pull/1425 for more details. However, these malformed alignments are of no concern. They are irrelevant to fusion detection. To my knowledge, STAR version 2.7.10a no longer produces any malformed alignments that impair fusion detection and Arriba's warnings about such alignments are harmless.

Regards, Sebastian

— Reply to this email directly, view it on GitHub https://github.com/suhrig/arriba/issues/133#issuecomment-1013688466, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACPLGDEBDKJ4AZK2KIF66HTUWF5VTANCNFSM5GN3NYHQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you authored the thread.