morinlab / GAMBLR

Set of standardized functions to operate with genomic data
https://morinlab.github.io/GAMBLR/
MIT License
3 stars 2 forks source link

Adding packages to and tidy DESCRIPTION file + hotfix for get manta_sv_by_sample #152

Closed mattssca closed 1 year ago

mattssca commented 1 year ago

This PR includes a very small update to the package DESCRIPTION file. I have added the following packages to be properly imported when installing GAMBLR. The packages were added with the following code:

usethis::use_package("plotly")
usethis::use_package("wordcloud")

While I was at it, I also ran usethis::use_tidy_description(), which orders and formats DESCRIPTION fields according to a fixed standard.

rdmorin commented 1 year ago

Usually when dependencies are missing it's because the function(s) that rely on them are not importing them. Do you know which are the offending functions? If the package isn't imported this may still lead to issues downstream for users who don't have them imported already.

mattssca commented 1 year ago

Indeed, I checked. plotly is correctly imported by fancy_qc_plot and wordcloud is correctly imported by prettyGeneCloud

rdmorin commented 1 year ago

If that's the case then isn't the DESCRIPTION file supposed to automatically be updated to ensure these are dependencies?

mattssca commented 1 year ago

I thought so too. But from a quick look (https://stackoverflow.com/questions/8597993/does-roxygen2-automatically-write-namespace-directives-for-imports-packages), it seems that the DESCRIPTION file is not automatically updated?

mattssca commented 1 year ago

The recently reported issues associated with get_manta_sv_by_sample have both been resolved in this PR.

  1. The first issue (chromosome X showing up as NA in the returned data frame), was easily resolved by specifying the column types in the read_tsv argument for this function.

  2. The second issue required some more digging to understand why some variants were dropped in the returned data frame. But this is what I found and what I have adjusted for. Certain variants do not have all the FORMAT fields causing vaf_tumour scores to be mistakenly set to 0 for such samples. The majority of variant calls in the available bedpe files have the following FORMAT fields; GT:PR:SR:TR:DP:VAF. However, some variants are missing the SR field, causing the subsequent fields to be shifted to the left (in the data wrangling steps of get_manta_sv_by_sample). This has been solved by applying the VCF data-wrangling steps from read_merge_manta_with_liftover (also a much cleaner solution toa achieve this!).

I have also added the filtering parameters to the internal call of get_manta_by_samples in get_manta_sv to ensure that when calling get_manta_sv with filtering parameters set to anything besides the default, these do not get hijacked by the default values in get_manta_sv_by_sample and get_manta_sv_by_samples.

Comparing the return from get_manta_sv_by_sample for 191621-PL01 with the data on disk (for a subset of variants on chr10), we now have one additional variant that this function previously did not return. The remaining missing variant gets filtered out by this function since it has a VAF_tumour value of 0.083 and a SCORE value of 38 (can be kept by lowering the filtering thresholds).

In addition, this update has been tested the following ways:

#get_manta_sv
all_sv_grch37 = get_manta_sv(projection = "grch37")
all_sv_hg38 = get_manta_sv(projection = "hg38")

#get_manta_with filtering parameters (make sure filtering parameters are working)
all_sv_grch37_score = get_manta_sv(projection = "grch37", min_vaf = 0.1, min_score = 100)
all_sv_grch37_vaf = get_manta_sv(projection = "grch37", min_vaf = 0.2, min_score = 40)

#print heads
head(all_sv_grch37_score, 5) %>% dplyr::select(1:8)
  CHROM_A   START_A     END_A CHROM_B   START_B     END_B manta_name   SCORE
1       1 208491481 208491485       2 132774528 132774532       <NA>  447.12
2      11  19976519  19976531       5  29752708  29752720       <NA>  228.09
3       3 102161188 102161190       7  55272296  55272298       <NA>  289.27
4       2  89161434  89161436       2  89597016  89597018       <NA> 1208.13
5       2  89132276  89132281       2  89512903  89512908       <NA>  330.90

head(all_sv_grch37_vaf, 5) %>% dplyr::select(1:13)
  CHROM_A   START_A     END_A CHROM_B   START_B     END_B manta_name   SCORE STRAND_A STRAND_B tumour_sample_id normal_sample_id VAF_tumour
1      11  19976519  19976531       5  29752708  29752720       <NA>  228.09        -        +          SP59272          SP59271  0.2083333
2       2  89161434  89161436       2  89597016  89597018       <NA> 1208.13        -        +          SP59328          SP59326  0.3105590
3      14 106330456 106330458      14 107083254 107083256       <NA>  410.67        +        -          SP59388          SP59386  0.2833333
4      14 106330451 106330452      14 106370356 106370357       <NA>  908.38        +        -         SP116776         SP116777  0.2528736
5      14 106370385 106370386      14 106573231 106573232       <NA>  773.54        +        -         SP116776         SP116777  0.3928571

#check if all samples are there (compared to the results saved on disk)
dplyr::filter(all_sv_grch37, tumour_sample_id == "191621-PL01", CHROM_A == "10" | CHROM_B == "10") %>% dplyr::select(1:6)
  CHROM_A   START_A     END_A CHROM_B   START_B     END_B
1      10  70269121  70269124      10  70327972  70327975
2      10 105959423 105959424      10 105959680 105959680
3      10 117524871 117524872      10 117524959 117524959
4      10 116456286 116456325      20  52671949  52671991
5       2  25149194  25149194      10 107935679 107935679

dplyr::filter(all_sv_hg38, tumour_sample_id == "191621-PL01", CHROM_A == "chr10" | CHROM_B == "chr10") %>% dplyr::select(1:6)
  CHROM_A   START_A     END_A CHROM_B   START_B     END_B
1   chr10  68509364  68509367   chr10  68568215  68568218
2   chr10 104199665 104199666   chr10 104199922 104199922
3   chr10 115765360 115765361   chr10 115765448 115765448
4   chr10 114696527 114696566   chr20  54055410  54055452
5    chr2  24926325  24926325   chr10 106175921 106175921

#check the tail of data (chrX showing up as NA)
tail(all_sv_grch37) %>% dplyr::select(1:6)
       CHROM_A   START_A     END_A CHROM_B   START_B     END_B
413875       X 155033031 155033568       X 155216481 155217043
413876       X 155033036 155033496       X 155216552 155217046
413877       X 155033055 155033432       X 155216471 155217016
413878       X 155033080 155033617       X 155216600 155217040
413879       X 155230794 155230797       X 155230863 155230866
413880       X 155251243 155251243       X 155252494 155252494

tail(all_sv_hg38) %>% dplyr::select(1:6)
       CHROM_A   START_A     END_A CHROM_B   START_B     END_B
412761    chrX 155803368 155803905    chrX 155986816 155987378
412762    chrX 155803373 155803833    chrX 155986887 155987381
412763    chrX 155803392 155803769    chrX 155986806 155987351
412764    chrX 155803417 155803954    chrX 155986935 155987375
412765    chrX 156001129 156001132    chrX 156001198 156001201
412766    chrX 156021578 156021578    chrX 156022829 156022829

#check if the coordinates are different
#check a sample with genome_build: hs37d5 (not in return from get_combined_sv)
dplyr::filter(all_sv_grch37, tumour_sample_id == "816-06-01TD") %>% dplyr::select(1:6) %>% head(4)
  CHROM_A   START_A     END_A CHROM_B   START_B     END_B
1       1 241051430 241051430       1 241132692 241132692
2       1 241070420 241070423       9 104683919 104683922
3      14 106092345 106092346      14 106210781 106210782
4      14 106112860 106112860      14 106211679 106211679

dplyr::filter(all_sv_hg38, tumour_sample_id == "816-06-01TD") %>% dplyr::select(1:6) %>% head(4)
  CHROM_A   START_A     END_A CHROM_B   START_B     END_B
1    chr1 240888130 240888130    chr1 240969392 240969392
2    chr1 240907120 240907123    chr9 101921637 101921640
3   chr14 105626008 105626009   chr14 105744444 105744445
4   chr14 105646523 105646523   chr14 105745342 105745342

#check a sample with genome_build: grc37 (not in return from get_combined_sv)
dplyr::filter(all_sv_grch37, tumour_sample_id == "01-16433_tumorC") %>% dplyr::select(1:6) %>% head(4)
  CHROM_A   START_A     END_A CHROM_B   START_B     END_B
1       1   2481675   2481678       1   2763758   2763761
2      10  81788849  81789401      10  81804755  81805085
3      10 127197008 127197486      10 127200837 127201137
4      11  67483904  67484488      11  67506748  67507117

dplyr::filter(all_sv_hg38, tumour_sample_id == "01-16433_tumorC") %>% dplyr::select(1:6) %>% head(4)
  CHROM_A   START_A     END_A CHROM_B   START_B     END_B
1    chr1   2550236   2550239    chr1   2847193   2847196
2   chr10  80029093  80029645   chr10  80044999  80045329
3   chr10 125508439 125508917   chr10 125512268 125512568
4   chr11  67716433  67717017   chr11  67739277  67739646

#check a sample with genome_build: hg38 (not in return from get_combined_sv)
dplyr::filter(all_sv_grch37, tumour_sample_id == "140127-PL02") %>% dplyr::select(1:6) %>% head(4)
  CHROM_A   START_A     END_A CHROM_B   START_B     END_B
1       1  16415297  16415475       1  16416075  16416253
2       1 157369911 157369911       1 157370191 157370191
3       1 242143687 242143687       1 242143756 242143756
4       1 248631449 248631520       1 248743242 248743329

dplyr::filter(all_sv_hg38, tumour_sample_id == "140127-PL02") %>% dplyr::select(1:6) %>% head(4)
  CHROM_A   START_A     END_A CHROM_B   START_B     END_B
1    chr1  16088802  16088980    chr1  16089580  16089758
2    chr1 157400121 157400121    chr1 157400401 157400401
3    chr1 241980385 241980385    chr1 241980454 241980454
4    chr1 248468148 248468219    chr1 248579941 248580028

#check a sample with genome_build: grch37 (is available in get_combined_sv)
dplyr::filter(all_sv_grch37, tumour_sample_id == "05-11725_tumorA") %>% dplyr::select(1:6) %>% head(4)
  CHROM_A START_A   END_A CHROM_B   START_B     END_B
1       1  948309  948315       6  30661003  30661009
2       1 1308047 1308051       1  50964244  50964248
3       1 1405036 1405051       4  17614779  17614794
4       1 1479720 1479721       6 134845806 134845807

dplyr::filter(all_sv_hg38, tumour_sample_id == "05-11725_tumorA") %>% dplyr::select(1:6) %>% head(4)
  CHROM_A START_A   END_A CHROM_B   START_B     END_B
1    chr1 1012928 1012935    chr6  30693225  30693232
2    chr1 1372666 1372671    chr1  50498571  50498576
3    chr1 1469655 1469671    chr4  17613155  17613171
4    chr1 1544339 1544341    chr6 134524667 134524669

#check a sample with genome_build: hg38 (is available in get_combined_sv)
dplyr::filter(all_sv_grch37, tumour_sample_id == "01-17074T") %>% dplyr::select(1:6) %>% head(4)
  CHROM_A  START_A    END_A CHROM_B  START_B    END_B
1      17 12740288 12740290      17 16931849 16931851
2      19 55960487 55960489       9 98919902 98919904
3      22 29578704 29578707      22 29817673 29817676

dplyr::filter(all_sv_hg38, tumour_sample_id == "01-17074T") %>% dplyr::select(1:6) %>% head(4)
  CHROM_A  START_A    END_A CHROM_B  START_B    END_B
1   chr17 12836972 12836973   chr17 17028536 17028537
2   chr19 55449121 55449122    chr9 96157621 96157622
3   chr22 29182717 29182719   chr22 29421685 29421687