suhrig / arriba

Fast and accurate gene fusion detection from RNA-Seq data
Other
214 stars 50 forks source link

intergenic breakpoints reported without distances to genes #202

Closed anoronh4 closed 8 months ago

anoronh4 commented 12 months ago

i am finding a few intergenic fusion calls that do not have distances to nearest genes annotated. According to the documentation: "If a breakpoint is in an intergenic region, Arriba lists the closest genes upstream and downstream from the breakpoint, separated by a comma. The numbers in parentheses after the closest genes state the distance to the genes."

$ cat mysample.fusions.tsv | cut -f 1-2,7-8,15 | grep intergenic | grep -v ,
RP11-180P8.1    TANC2   exon    intergenic  high
RP1 RP11-56A10.1    CDS/splice-site intergenic  high
METTL15 RP11-22P4.1 CDS/splice-site intergenic  high
BTBD8   KIAA1107    CDS intergenic  medium
C4orf3  KLHL2P1 intron  intergenic  medium
AC104651.1  RP11-727A23.10  intergenic  intron  medium
FAM115C FAM115B CDS/splice-site intergenic  medium
LINC01138   LINC00869   intergenic  exon    medium
AKR1E2  AKR1C1  intergenic  5'UTR/splice-site   low
SULT1C2 SULT1C2P1   intergenic  exon    low
AP000783.1  GRAMD1B CDS/splice-site intergenic  low

Just wondering if i should interpret these calls differently than other fusions that involve intergenic breakpoints. i am using arriba 2.3.0

suhrig commented 12 months ago

I agree this is in conflict with the documentation. It happens in an ambivalent situation where a breakpoint can be considered both to be part of a gene and intergenic at the same time. Namely, when the breakpoint coordinate is outside a gene, but the supporting reads at the breakpoint are spliced to the body of a nearby gene. In this situation the reads clearly originate from the involved gene, but the fusion breakpoint is intergenic.

Others have complained about this confusing annotation previously. I tried to fix it, but this turned out to be very complicated and would conflicts in the code due to the fact that we are dealing with a conflicting situation here. I should probably make a note about this in the documentation. Then again, I want to keep the docs simple and not overflow them with all kinds of rare edge cases.