suhrig / arriba

Fast and accurate gene fusion detection from RNA-Seq data
Other
220 stars 50 forks source link

Confidence determination #245

Open YinYangKarly opened 1 month ago

YinYangKarly commented 1 month ago

Dear dr, Uhrig,

I have short read samples, one of them we know the CIC-DUX4 fusion is present. Alignment was done with STAR. After running Arriba on the generated BAM file, the expected fusion was found in the discarded.fusions.tsv and had a low confident. While disabling thehomologs filter, the correct fusion could be found in the fusion.tsv. The strange thing is the CIC with DUX4 pseudogenes (two or three, DUX4L...) had a higher confidence (medium) than the actual fusion (low). I carefully studied your paper and the Arriba documentation (Confidence scoring in Interpretation of results), but could not figure why DUX4 pseudogenes had higher confidence than the actual fusion. How does Arriba determine the confidence level of a fusion?

suhrig commented 1 month ago

The DUX4... genes are almost identical in sequence. It is more or less arbitrary to which of these copies the STAR aligner maps a read. As such, it is also pretty arbitrary where Arriba makes a fusion call. The homologs filter simply makes sure (to some degree) that Arriba reports a fusion for only one of them. When you see a fusion between CIC-DUX4L..., this is actually an indication for a CIC-DUX4 fusion.

Do you see a fusion call involving any of the DUX4... genes in the main output file fusions.tsv with the homologs filter ENABLED? Or do you need to disable the filter for at least one of the fusions to be reported in the main output file?

Confidence scoring is complex. There are many rules which determine the score. The most important one is the number of supporting reads, however. My guess is that the fusion with DUX4 has fewer supporting reads than the ones involving the pseudogenes. Again, it is arbitrary where STAR puts the reads. Hence, the score is also not indicative if the fusion with a pseudogene is more reliable than with the real gene.

YinYangKarly commented 1 month ago

Thank you for you quick response! I understand CIC-DUX4L... is an indication that CIC-DUX4.. is found by Arriba. However, my supervisor is interested in finding the exact fusion. First, I did an Arriba run with all filters ENABLED. The CIC-DUX4... is then found in fusions.discarded.tsv, while disabling the homologs filter put the CIC-DUX4... fusion in fusion.tsv.

I will contact the developer of STAR about the strange issue that DUX4L... genes have more supporting reads than DUX4...,

suhrig commented 1 month ago

I doubt there is anything that the developer of STAR can do about this. The genes have almost identical sequence. From an alignment perspective, they cannot be disambiguated. This needs to happen at the interpretation stage, i.e., you will need to dig out the CIC-DUX4 fusion from the discarded file.