morinlab / GAMBLR

Set of standardized functions to operate with genomic data
MIT License
4 stars 2 forks source link

annotate_sv is reporting some SV with more than one annotation #247

Closed rdmorin closed 1 year ago

rdmorin commented 1 year ago

Currently, for some edge cases, annotate_sv will report the same SV with more than one annotation, thereby artificially duplicating it in the output. We should modify the function to only ever report one fusion per breakpoint. The current (known) examples all involve BCL6 because this region is considered both as a partner and an oncogene. In the example below, the annotation CIITA-BCL6 (row 1) and BCL6-CIITA (row 4) refer to the same event. There also appears to be some other duplication in the output (see row 1 vs 2 in output below). This should also be resolved. Whenever BCl6 is involved we should pick the fusion that has BCL6 as the "gene" rather than the "partner", i.e. prioritizing it as a recurrent oncogene rather than recurrent partner. When doing this we need to also confirm that this change doesn't cause other annotations to change (e.g. the BCL6-MYC rearrangements)

   chrom1    start1      end1 chrom2   start2     end2 name score strand1 strand2 tumour_sample_id  gene partner     fusion
 1:      3 187466157 187466157     16 10969379 10969379    .   204       +       +        14-41461T  BCL6   CIITA CIITA-BCL6
 2:      3 187466157 187466157     16 10969379 10969379    .   204       +       +        14-41461T  BCL6   CIITA CIITA-BCL6
 3:      3 187276658 187276662     16 10862788 10862792    .    85       -       -        14-41461T CIITA    BCL6 BCL6-CIITA
 4:      3 187466157 187466157     16 10969379 10969379    .   204       +       +        14-41461T CIITA    BCL6 BCL6-CIITA
vladimirsouza commented 1 year ago

Let me see if I understood correctly. If there are two rows where everything is the same (not considering gene, partner and fusion columns), but in one of them gene is BCL6 and partner is XXX, and in the other row gene is XXX and partner is BCL6, we keep the one where gene is BCL6. However, if XXX is MYC (the only exception), we keep the row where partner is BCL6 (then gene is MYC). Is this correct?

vladimirsouza commented 1 year ago

That is, MYC has the highest priority to be gene, BCL6 has the second highest priority, and all others have the same low priority.

vladimirsouza commented 1 year ago

I guess when cases like this (duplication; only gene, partner and fusion columns are different),

   chrom1   start1     end1 chrom2   start2     end2 name score strand1 strand2 tumour_sample_id    gene partner        fusion
1:     13 91975699 91975699     16 10980639 10980639    .    73       +       +         SP124973 MIR17HG   CIITA CIITA-MIR17HG
2:     13 91975699 91975699     16 10980639 10980639    .    73       +       +         SP124973   CIITA    <NA>      NA-CIITA

we keep the line where partner is not NA (first line).

PS: In the second line, partner is NA because its region doesn't overlap with any region in GAMBLR.data::grch37_partners.

vladimirsouza commented 1 year ago

Are these rows all right? Only gene, entrez and fusion columns are different. Is it all right the partner be NA?

   chrom1   start1     end1 chrom2   start2     end2 name score strand1 strand2 tumour_sample_id  gene entrez partner   fusion
1:     11 69581205 69581218     11 69782156 69782169    .    53       +       -        13-32258T CCND1    595    <NA> NA-CCND1
2:     11 69581205 69581218     11 69782156 69782169    .    53       +       -        13-32258T  FGF3   2248    <NA>  NA-FGF3

PS: CCND1 refers to the 11:69581205-69581218 region and FGF3 to 11:69782156-69782169.

rdmorin commented 1 year ago

The only thing that needs to change is when partner is not NA for both lines

vladimirsouza commented 1 year ago

Issue solved in this PR.