Fusion Sites Overlaping RepeatMasker DNA Repeats

DarioS commented 3 years ago

Sometimes, one segment of a split read maps to a repeat such as a LINE or SINE. Is such a fusion plausible? SINE repeats are present in thousands of places in the reference genome and the best mapping might be quite similar to the second best and other ones.

The location of the other part of the read is far from any exons for KNTC1 and inside an Alu repeat.

None of the reads belonging to the Alu retrotransposon (pink reads) are spliced to any exons of KNTC1 (blue reads), so it's probably wrong that arriba joins all of the exons of KNTC1 in the PDF plot. The soft-clipped reads do map perfectly to CLIP1 though.

So, even with all the complex filters already in arriba, it seems there are more to add in future! This seems to be a similar mechanism described in Alu Elements: At The Crossroads Between Disease and Evolution

suhrig commented 3 years ago

True, the fusion does not really involve KNTC1 (apart from disrupting the gene body). There is no fusion between CLIP1 and KNTC1 per se - not only are all of the fusion reads intronic, they are on the antisense strand as well. But if the Alu sequence is not annotated in the gene model that you use, then Arriba assigns the next best hit, which is KNTC1 in this case. You can also see this from the fusion plot in that the genes are fused head-to-head.

It's not obvious to tell if CLIP1 is fused with this particular Alu copy or another one elsewhere in the genome, unless there is a uniquely identifying SNP that distinguishes this copy from others. And then again, it could be a false positive prediction altogether. You can search for other reads in KNTC1 that link it to CLIP1 (e.g., a reciprocal translocation) for confirmation. The best confirmation would be a matched whole-genome sequencing sample with a genomic breakpoint in the vicinity - if happen to have that data.

Yes, filters can always be improved. The types of artifacts are unlimited, it seems.

DarioS commented 3 years ago

Now that you mention it, closest_genomic_breakpoint1 and closest_genomic_breakpoint2 are . for that variant and most others. Why do some variants have a dot value but others have values like chr2:186503623(1071) ? Also, what's the number inside the brackets?

suhrig commented 3 years ago

A dot indicates that there is no SV that matches the fusion. There is a higher chance that it is a false positive, but not necessarily. There are exceptions, for example: the SV caller may have missed the SV, or it's a complex rearrangement like pos1 -> pos2 -> pos3, which is reported by the SV caller as two separate SVs pos1 -> pos2 and pos2 -> pos3 and by Arriba as pos1 -> pos3 (and thus cannot be matched to the SVs), because the spliceosome skips the intermediate breakpoints.

The value in the parentheses is the distance to the breakpoint. Should've mentioned this in the manual ...

suhrig commented 3 years ago

I used data from this study, which compares various polyA-selection kits and ribo-depletion kits. I don't see a bias towards intergenic breakpoints for any of the protocols (or towards any other breakpoint sites for that matter):

site_frequency

The overall number of fusion predictions are also similar. Moreover, when I annotate the breakpoints with SINE/LINE/Alu repeats from the UCSC RepeatMasker track, I don't see an enrichment in the ribo-depleted libraries. So I'm having a bit of trouble replicating your observations.

How severe is this issue in your samples? What fraction of the calls is located in repeats? It could be that there really is a fusion with a repeat, just not with the one where STAR put the reads. This would explain why the breakpoints are not confirmed by structural variants. The breakpoints reported by Arriba may be wrong, but the fact that some gene is fused to a repeat would still be correct.

Regarding your question about a benchmark with other tools on ribo-depleted samples: I realized that such a benchmark would be pointless, because most fusion detection tools do not call breakpoints in intronic or intergenic regions.

DarioS commented 3 years ago

About one-third of fusion calls per sample have the splice-site to intron and it always appears to be to an Alu element. It's similar to the bar chart displayed above. I have matched adjacent normal RNA-seq as well, so I could check if these Alu are present in the normal sample and simply because natural population variation between people not captured in the human reference genome.

suhrig / arriba

Fusion Sites Overlaping RepeatMasker DNA Repeats #92