suhrig / arriba

Fast and accurate gene fusion detection from RNA-Seq data
Other
214 stars 50 forks source link

Can one read be used to support multiple fusions #203

Open acloop62hz opened 11 months ago

acloop62hz commented 11 months ago

Hi @suhrig , I was running Arriba on this simulation dataset : https://data.broadinstitute.org/Trinity/CTAT_FUSIONTRANS_BENCHMARKING/on_simulated_data/sim_101 (In the case I'm about to mention it's reads from sim3.) I used the default script of Arriba_v2.4.0 and found these two fusions in fusions.tsv: ISCA1 SLC30A1 -/- -/- 9:88886922 1:211749631 CDS/splice-site CDS/splice-site translocation 299 289 152 1732 4155 high in-frame . Iron-sulphur_cluster_biosynthesis(58%)|Cation_efflux_family(52%) . . ENSG00000135070.9 ENSG00000170385.9 ENST00000375991.4 ENST00000367001.4 upstream downstream duplicates(11),low_entropy(3),mismatches(209),multimappers(162) CTD-2561J22.1 SLC30A1 -/- -/- 19:21564171 1:211749631 exon CDS/splice-site translocation 3 247 1 10 4155 high . . |Cation_efflux_family(52%) . . ENSG00000268731.1 ENSG00000170385.9 ENST00000597319.1 ENST00000367001.4 upstream downstream duplicates(6),mismatches(100),multimappers(253)

Between these two fusions, there are 166 identical read identifiers, for example: ISCA1|ENSG00000135070.9--SLC30A1|ENSG00000170385.9:10030480_1_76430_4501_261 ISCA1|ENSG00000135070.9--SLC30A1|ENSG00000170385.9:1010524_1_76430_4597_166 ISCA1|ENSG00000135070.9--SLC30A1|ENSG00000170385.9:10424315_1_76430_4580_164 ISCA1|ENSG00000135070.9--SLC30A1|ENSG00000170385.9:10462442_1_76430_4592_195 ISCA1|ENSG00000135070.9--SLC30A1|ENSG00000170385.9:10479657_1_76430_4532_241

According to Arriba's multimappers filter (I assume these reads are multi-mapping reads), "multi-mapping reads are reduced to a single alignment", each read should be mapped to only one fusion. Thus I wonder if my understanding for this filter is correct or if there's something wrong with the output.

Thanks for your time!

suhrig commented 11 months ago

Your understanding is correct. While they are listed in the column read_identifiers for both fusions, they are counted for only one of the fusions in the columns split_reads1/2 and discordant_mates.

All read-level filters (including the multimappers filter) work like this: they just exclude reads from being counted in the columns split_reads1/2 and discordant_mates (those are relevant to Arriba's internal filters). They never remove the reads from the read_identifiers column, however. You can think of it like this: the read count columns contain all high-quality reads which passed all filters. The column read_identifiers contains all supporting reads, including ones that were removed by filters. The manual is not sufficiently clear about this, sorry.

BTW, not only multimappers may lead to read identifiers being listed for multiple fusions. Discordant mates which are in the vicinity of breakpoints close to each other also often do.

acloop62hz commented 11 months ago

Thanks for your explanation, @suhrig! Is there any way to know those read identifiers that passed all filters?

suhrig commented 11 months ago

This is not possible, unfortunately. However, I don't think this is meaningful either. If a fusion made it to the main output file of Arriba, you should consider all reads listed in the read_identifiers column as supporting evidence for the fusion; not just the ones that Arriba considers high quality. The fact that the fusion passed all filters suggests that none of the supporting reads should have been removed in the first place.

acloop62hz commented 11 months ago

I see, thanks! One more question, I saw there's another issue where Arriba just picks one of the fusions indicated by multi-mapping reads "KANSL1-ARL17A/KANSL1-ARL17B fusion detection #165". I wonder if the difference between the two cases is that in "KANSL1-ARL17A/KANSL1-ARL17B", the alignment corresponding to KANSL1-ARL17A fusion are all discarded, thus no reads support KANSL1-ARL17A (or reads supporting KANSL1-ARL17A are moved by other filters), and while in the case I mentioned above, there exist both reads reduced to alignment supporting ISCA1--SLC30A1 and CTD-2561J22.1--SLC30A1

suhrig commented 11 months ago

I wonder if the difference between the two cases is that in "KANSL1-ARL17A/KANSL1-ARL17B", the alignment corresponding to KANSL1-ARL17A fusion are all discarded, thus no reads support KANSL1-ARL17A (or reads supporting KANSL1-ARL17A are moved by other filters)

Yes, that's most likely what happened in this case.

while in the case I mentioned above, there exist both reads reduced to alignment supporting ISCA1--SLC30A1 and CTD-2561J22.1--SLC30A1

After assignment of the multimapping reads to one of these two fusions, both fusions still have a sufficient number of supporting reads for Arriba to consider them real. However, I think Arriba made a mistake here. Compare the supporting reads of the two fusions: ISCA1--SLC30A1 has many split reads mapping to both fusion partners (columns split_reads1 and split_reads2) as well as discordant mates, whereas CTD-2561J22.1--SLC30A1 only has many split_reads1 and not much else. Having only one of the columns split_reads1, split_reads2, or discordant_mates set to a non-zero value is a characteristic of a false positive. I think that even the 247 split_reads assigned to the second fusion really belong to the first one and have just been misaligned and Arriba failed to recognize this. The second fusion is probably an artifact caused by sequence similarly between ISCA1 and CTD-2561J22.1.

acloop62hz commented 11 months ago

Thanks for your reply! After looking into the code, I think the problem might be that after merging adjacent fusions of CTD-2561J22.1--SLC30A1: fusion filter split_read1 split_read2 discordant_mate SLC30A1--CTD-2561J22.1--211749631--21564171 merge_adjacent 245 3 1 SLC30A1--CTD-2561J22.1--211749630--21564170 496 7 1 SLC30A1--CTD-2561J22.1--211749457--21564170 mismatches 0 0 0 SLC30A1--CTD-2561J22.1--211749614--21564174 mismatches 0 0 0 SLC30A1--CTD-2561J22.1--211749629--21564169 merge_adjacent 1 0 1 SLC30A1--CTD-2561J22.1--211749627--21564167 mismatches 0 0 0 SLC30A1--CTD-2561J22.1--211749521--21564187 0 0 1 SLC30A1--CTD-2561J22.1--211749628--21564168 mismatches 0 0 0 SLC30A1--CTD-2561J22.1--211749632--21564172 merge_adjacent 1 0 1

only the fusion SLC30A1--CTD-2561J22.1--211749630--21564170 is kept, but its spilt_read_listdoes not include reads from other fusions, thus when going through the list and discarding the multimappers, alignments in other fusions that are also multimappers are not discarded.

Is there any reason not to include reads into the list for other merged fusions? Should we include them so that the supporting read count would be more accurate?