umccr / RNAsum

Pipeline for generating RNAseq-based cancer patient reports
https://umccr.github.io/RNAsum/
Other
7 stars 4 forks source link

Merge duplicated CGI known translocations #89

Closed pdiakumis closed 1 year ago

pdiakumis commented 1 year ago

Need to merge some MLL_* duplicated fusions by the source column (i.e. have them once with a source of validated;cgc (which happens throughout that file anyhow)). See file in inst/rawdata/cancer_biomarkers_database/cancer_genes_upon_trans.tsv:

# find duplicated fusions in CGI file

ref_genes.list[["cancer_biomarkers_trans"]] |> 
  dplyr::group_by(translocation) |> 
  dplyr::filter(n() > 1)

# A tibble: 10 × 4
# Groups:   translocation [5]
   translocation effector_gene cancer_acronym source   
   <chr>         <chr>         <chr>          <chr>    
 1 MLL__MLLT1    MLL           ALL;AML        cgc      
 2 MLL__MLLT1    MLL           ALL;AML        validated
 3 MLL__MLLT10   MLL           ALL;AML        cgc      
 4 MLL__MLLT10   MLL           ALL;AML        validated
 5 MLL__MLLT3    MLL           ALL;AML        cgc      
 6 MLL__MLLT3    MLL           ALL;AML        validated
 7 MLL__MLLT4    MLL           ALL;AML        cgc      
 8 MLL__MLLT4    MLL           ALL;AML        validated
 9 MLL__MLLT6    MLL           ALL;AML        cgc      
10 MLL__MLLT6    MLL           ALL;AML        validated

Rest of the file is okay. Maybe should update that file when we get to that stage as well.

pdiakumis commented 1 year ago
kt_cgi <- ref_genes.list[["cancer_biomarkers_trans"]]
# merge duplicated fusions by source (see https://github.com/umccr/RNAsum/issues/89)
dup_id_cols <- c("translocation", "effector_gene", "cancer_acronym")
kt_cgi_dup <- kt_cgi |>
  dplyr::group_by(translocation) |> 
  dplyr::filter(n() > 1) |>
  dplyr::ungroup() |>
  tidyr::pivot_wider(id_cols = dplyr::all_of(dup_id_cols),
                     names_from = "source", values_from = "source") |>
  tidyr::unite(!dplyr::all_of(dup_id_cols), col = "source", sep = ";")
# A tibble: 5 × 4
  translocation effector_gene cancer_acronym source       
  <chr>         <chr>         <chr>          <chr>        
1 MLL__MLLT1    MLL           ALL;AML        cgc;validated
2 MLL__MLLT10   MLL           ALL;AML        cgc;validated
3 MLL__MLLT3    MLL           ALL;AML        cgc;validated
4 MLL__MLLT4    MLL           ALL;AML        cgc;validated
5 MLL__MLLT6    MLL           ALL;AML        cgc;validated
pdiakumis commented 1 year ago

Done via #88.