Closed rdmorin closed 1 year ago
Let me see if I understood correctly.
If there are two rows where everything is the same (not considering gene, partner and fusion columns), but in one of them gene
is BCL6 and partner
is XXX, and in the other row gene
is XXX and partner
is BCL6, we keep the one where gene
is BCL6. However, if XXX is MYC (the only exception), we keep the row where partner
is BCL6 (then gene
is MYC). Is this correct?
That is, MYC has the highest priority to be gene
, BCL6 has the second highest priority, and all others have the same low priority.
I guess when cases like this (duplication; only gene
, partner
and fusion
columns are different),
chrom1 start1 end1 chrom2 start2 end2 name score strand1 strand2 tumour_sample_id gene partner fusion
1: 13 91975699 91975699 16 10980639 10980639 . 73 + + SP124973 MIR17HG CIITA CIITA-MIR17HG
2: 13 91975699 91975699 16 10980639 10980639 . 73 + + SP124973 CIITA <NA> NA-CIITA
we keep the line where partner
is not NA (first line).
PS: In the second line, partner
is NA because its region doesn't overlap with any region in GAMBLR.data::grch37_partners
.
Are these rows all right? Only gene, entrez and fusion columns are different. Is it all right the partner be NA?
chrom1 start1 end1 chrom2 start2 end2 name score strand1 strand2 tumour_sample_id gene entrez partner fusion
1: 11 69581205 69581218 11 69782156 69782169 . 53 + - 13-32258T CCND1 595 <NA> NA-CCND1
2: 11 69581205 69581218 11 69782156 69782169 . 53 + - 13-32258T FGF3 2248 <NA> NA-FGF3
PS: CCND1
refers to the 11:69581205-69581218
region and FGF3
to 11:69782156-69782169
.
The only thing that needs to change is when partner is not NA for both lines
Issue solved in this PR.
Currently, for some edge cases, annotate_sv will report the same SV with more than one annotation, thereby artificially duplicating it in the output. We should modify the function to only ever report one fusion per breakpoint. The current (known) examples all involve BCL6 because this region is considered both as a partner and an oncogene. In the example below, the annotation CIITA-BCL6 (row 1) and BCL6-CIITA (row 4) refer to the same event. There also appears to be some other duplication in the output (see row 1 vs 2 in output below). This should also be resolved. Whenever BCl6 is involved we should pick the fusion that has BCL6 as the "gene" rather than the "partner", i.e. prioritizing it as a recurrent oncogene rather than recurrent partner. When doing this we need to also confirm that this change doesn't cause other annotations to change (e.g. the BCL6-MYC rearrangements)