morinlab / GAMBLR

Set of standardized functions to operate with genomic data
https://morinlab.github.io/GAMBLR/
MIT License
3 stars 2 forks source link

GAMBLR Improvements #148

Closed mattssca closed 1 year ago

mattssca commented 1 year ago

Updates included in this PR

Major changes:

  1. sv_to_custom_track up until now, does not work with the parameter is_annotated = FALSE. Possible due to an old auto-merge or remnants of old code. Maybe this function is not called as frequently, allowing the bug to unknowingly exist. The function was resolved and new examples were added (for both Boolean cases with the is_annotated parameter).

  2. gene_to_region and region_to_gene have been reverted to use locally stored gene coordinates and not rely on biomartR (unable to get the desired ensemble version (86) of gene annotations). Previously no hg38 gene coordinates were available in GAMBLR, restricting the just mentioned functions to only work with grch37. Gene annotations with respect to hg38 have been added to /GAMBLR/data/hg38_gene_coordiantes.Rda. Script on how these coordinates were retrieved and cleaned up has been added to /data-raw/DATASET.R for reproducibility purposes.

  3. Updates to prettyForest plot. I was unsuccessful in reproducing the error Ryan reported last week (even with his exact datasets). So not sure what is going on here. Maybe Rayan can have another go at it and see if the error persists. This function has also been updated to now work without a gene list provided. If this parameter is not used, the function will extract all mutated genes from the incoming maf and with another optional parameter, n_mutations, the user can now easily subset this gene list to any given mutation frequency, e.g set to 20 to only keep genes that are mutated in 20 (or more) samples. Allowing the user to reduce the number of test runs, giving a better chance of a significant result. Attached is the plot I generated using Ryan’s datasets, as well as the source code.

bl_meth_meta = read_tsv(file = "../../bl_meth_meta.tsv")
bl_coding = read_tsv(file = "../../bl_coding.tsv")

dlbcl_genes = lymphoma_genes %>%
  dplyr::filter(DLBCL == TRUE) %>%
  pull(Gene)

prettyForestPlot(maf = bl_coding,
                 metadata = bl_meth_meta,
                 comparison_column = "cluster",
                 comparison_values = c("cluster1", "cluster2"),
                 genes = dlbcl_genes,
                 max_q = 0.1)

animated

  1. New functions, get_manta_sv_by_sample and get_manta_sv_by_samples (wraps the first out of the two functions). For the design of these functions, I picked up Ryan's active PR (#143) and expanded upon this code. Relevant VCF information (based on returned columns for get_manta_sv) has also been added, as well as multiple checks, examples, documentation, etc. I have tested these functions with a variety of parameter combinations and the returns are satisfactory. For the reviewer's convenience, below is a sample output of calling get_manta_sv_by_sample on one sample, testing is obviously still encouraged.
> my_bedpe_df = get_manta_sv_by_sample(this_sample_id = "00-14595_tumorA")
[1] "Reading from: /projects/nhl_meta_analysis_scratch/gambl/results_local/gambl/manta_current/99-outputs/bedpe/genome--grch37/somaticSV/00-14595_tumorA--00-14595_normal--matched.somaticSV.bedpe"                    
> head(my_bedpe_df)
  CHROM_A  START_A    END_A CHROM_B  START_B    END_B NAME SOMATIC_SCORE STRAND_A STRAND_B TYPE FILTER VAF_tumour VAF_normal DP_tumour DP_normal tumour_sample_id normal_sample_id pair_status
1       1  1556541  1556547       1  1556664  1556670    .            40        -        -  BND   PASS      0.145          0        55        73  00-14595_tumorA  00-14595_normal     matched
2       1  6012725  6012732       1  6012825  6012832    .            48        +        +  BND   PASS      0.156          0        32        85  00-14595_tumorA  00-14595_normal     matched
3       1  8464072  8464090       1  8464293  8464311    .            40        +        +  BND   PASS      0.112          0       215        84  00-14595_tumorA  00-14595_normal     matched
4       1 10084099 10084251       1 10084266 10084411    .            48        -        -  BND   PASS      0.186          0        43        50  00-14595_tumorA  00-14595_normal     matched
5       1 10526162 10526172       1 10526290 10526300    .            40        -        -  BND   PASS      0.157          0       121        92  00-14595_tumorA  00-14595_normal     matched
6       1 15878430 15878436       1 15878608 15878614    .            40        +        +  BND   PASS      0.119          0       126       101  00-14595_tumorA  00-14595_normal     matched

Minor changes:

  1. fancy_v_chrcount now internally calls collate_results and not collate_qc_results.

  2. Parameter description updates for get_combined_sv to accurately represent the functionality of the projection parameter.

  3. Fixin typos and missing parameter descriptions to get_sample_cn_segments.

  4. Updating parameter descriptions for get_manta_sv.

Pull Request Checklists

Important: When opening a pull request, keep only the applicable checklist and delete all other sections.

Checklist for all PRs

Required

This can be checked and addressed by running check_functions.pl and responding to the prompts. Test your code after you do this.

Optional but preferred with PRs

Checklist for New Functions

Required

Example:

#' Use GISTIC2.0 scores output to reproduce maftools::chromoplot with more flexibility
#'
#' @param scores output file scores.gistic from the run of GISTIC2.0
#' @param genes_to_label optional. Provide a data frame of genes to label (if mutated). The first 3 columns must contain chromosome, start, and end coordinates. Another required column must contain gene names and be named `gene`. (truncated for example)
#' @param cutoff optional. Used to determine which regions to color as aberrant. Must be float in the range [0-1]. (truncated for example)

Example:

#' @return nothing
#' @export
#' @import tidyverse ggrepel

Checklist for changes to existing code

mattssca commented 1 year ago

Review comments have been addressed and merge conflicts resolved. ready for rereview and merge if no outstanding issues remain.

rdmorin commented 1 year ago

Please set all comment threads that are resolved to "resolved" so I can focus on unresolved ones. The liftover of ICGC data is a bug that needs to be addressed, I think.

mattssca commented 1 year ago

All conversations have been resolved, except for the "liftover of ICGC data" comment.

mattssca commented 1 year ago

This function has now been updated to accurately deal with different genome builds (other than grch37 and hg38).

The function now looks for any discrepancies between genome_build and selected projection for a particular sample, if genome build contains either 37 or 19 in its string (e.g grch37, hg19, hs37d5), this is all equivalent to grch37 (projection). Meaning that if the genom_build for the selected sample is either grch37, hg19, hs37d5 and the selected projection is grch37 or hg19, no lift-over will take place. In the same example, if the selected projection is hg38 or grch38 the variant calls will be lifted to the selected projection.

Similarly, if the genome_build has 38 in its string and the selected projection is grch37 or hg19, liftover will take place and if the projection (in the same example) is hg38 or grch38 no lift-over will take place.

I have also removed the parameter with_chr_prefix and automated this step based on the selected projection.

mattssca commented 1 year ago

I've pushed updates to the data wrangling steps of get_manta_sv_by_sample, based on that get_manta_sv now calls get_combined_sv instead of get_merged_results. I will have to run a few tests tomorrow before it's ready for review. I will call on reviewers when this is done. Thanks!

mattssca commented 1 year ago

Testing is done and this PR is ready for review.

mattssca commented 1 year ago

This PR has been updated to solve the reported bug with certain samples not being subject to the liftover step, as intended. This bug was caused by the internal call of get_manta_sv_by_samples in get_manta_sv. This function was missing the projection parameter, meaning that the selected projection for all samples retrieved with this function defaulted to grch37, resulting in not lifting any samples to any other projection if specified within get_manta_sv.

After solving this another issue was discovered. The return for all lifted samples was just an empty data frame. After tweaking the filtering parameters (pass, pairing_status, min_vaf, and score) it was discovered that the original lifted data frame actually had samples in it, but that all values for VAF_tumour had been swapped to 0, meaning that no samples would show up with default value for min_vaf parameter in get_manta_sv. This bug was caught in liftover_bedpe, which automatically converts all numeric columns to integers, resulting in transforming all VAF_tumour values to zero. This code was dropped from this function after confirming with Kostia that this could be done.

get_manta_sv was then tested with a combination of parameters and careful interrogation of samples not returned with get_combined_sv (i.e samples returned with get_manta_v_by_samples). Coordinates for a set of such samples were compared to make sure that liftover had actually taken place, as well as that chr prefixes had been dealt with in the expected manner.

In addition, the with_chr_prefix parameter for get_manta_sv has now been dropped, since all the chr-prefix-handling is now based on the selected projection. If the user still wants to control if their variants are prefixed or not, I think it’s reasonable that they can do so on the return data frame themselves.

Kostia and I discussed a potential update for how get_manta_by_sample and get_manta_by_samples work. In short, we reasoned that get_manta_by_sample could be updated to not do any lifting, ever. This means this function would instead be interpreted as a helper function, called by get_manta_by_samples, where all lifting will occur. This would deprecate the force_lift parameter, potentially avoiding any confusion about how this parameter is intended to be used. It also condenses the code for get_manta_by_sample to some degree. One downside in doing this would be that a user would no longer be able to call get_manta_by_sample on the fly, for specific samples (rare cases?), instead, they would have to depend on get_manta_by_samples with the sample ID of interest presented in a metadata data frame, given to the these_samples_metadata parameter.

Here’s a reproducible example of how the updates were tested:

#run get_manta_sv
all_sv_grch37  = get_manta_sv(projection = "grch37")
all_sv_hg38  = get_manta_sv(projection = "hg38")

> #check if the coordinates are different
> #check a sample with genome_build: hs37d5 (not in return from get_combined_sv)
> dplyr::filter(all_sv_grch37, tumour_sample_id == "816-06-01TD") %>% dplyr::select(1:6) %>% head(4)
  CHROM_A   START_A     END_A CHROM_B   START_B     END_B
1       1 241051430 241051430       1 241132692 241132692
2       1 241070420 241070423       9 104683919 104683922
3      14 106092345 106092346      14 106210781 106210782
4      14 106112860 106112860      14 106211679 106211679

> dplyr::filter(all_sv_hg38, tumour_sample_id == "816-06-01TD") %>% dplyr::select(1:6) %>% head(4)
  CHROM_A   START_A     END_A CHROM_B   START_B     END_B
1    chr1 240888130 240888130    chr1 240969392 240969392
2    chr1 240907120 240907123    chr9 101921637 101921640
3   chr14 105626008 105626009   chr14 105744444 105744445
4   chr14 105646523 105646523   chr14 105745342 105745342

> #check a sample with genome_build: grc37 (not in return from get_combined_sv)
> dplyr::filter(all_sv_grch37, tumour_sample_id == "01-16433_tumorC") %>% dplyr::select(1:6) %>% head(4)
  CHROM_A  START_A    END_A CHROM_B  START_B    END_B
1       2 59584524 59584527       2 63324857 63324860
2       2 63623039 63623042       2 65968099 65968102
3       2 65968014 65968016       2 65968170 65968172
4       2 77795746 77795750       2 77795801 77795801

> dplyr::filter(all_sv_hg38, tumour_sample_id == "01-16433_tumorC") %>% dplyr::select(1:6) %>% head(4)
  CHROM_A  START_A    END_A CHROM_B  START_B    END_B
1    chr2 59357389 59357392    chr2 63097722 63097725
2    chr2 63395904 63395907    chr2 65740965 65740968
3    chr2 65740880 65740882    chr2 65741036 65741038
4    chr2 77568620 77568624    chr2 77568675 77568675

> #check a sample with genome_build: hg38 (not in return from get_combined_sv)
> dplyr::filter(all_sv_grch37, tumour_sample_id == "140127-PL02") %>% dplyr::select(1:6) %>% head(4)
  CHROM_A   START_A     END_A CHROM_B   START_B     END_B
1       1 157369911 157369911       1 157370191 157370191
2       1 242143687 242143687       1 242143756 242143756
3       1 203222147 203222147       8 128779606 128779606
4      11 110090806 110090812      11 111312817 111312823

> dplyr::filter(all_sv_hg38, tumour_sample_id == "140127-PL02") %>% dplyr::select(1:6) %>% head(4)
  CHROM_A   START_A     END_A CHROM_B   START_B     END_B
1    chr1 157400121 157400121    chr1 157400401 157400401
2    chr1 241980385 241980385    chr1 241980454 241980454
3    chr1 203253019 203253019    chr8 127767360 127767360
4   chr11 110220081 110220087   chr11 111442092 111442098

> #check a sample with genome_build: grch37 (is available in get_combined_sv)
> dplyr::filter(all_sv_grch37, tumour_sample_id == "05-11725_tumorA") %>% dplyr::select(1:6) %>% head(4)
  CHROM_A START_A   END_A CHROM_B   START_B     END_B
1       1  948309  948315       6  30661003  30661009
2       1 1308047 1308051       1  50964244  50964248
3       1 1405036 1405051       4  17614779  17614794
4       1 1479720 1479721       6 134845806 134845807

> dplyr::filter(all_sv_hg38, tumour_sample_id == "05-11725_tumorA") %>% dplyr::select(1:6) %>% head(4)
  CHROM_A START_A   END_A CHROM_B   START_B     END_B
1    chr1 1012928 1012935    chr6  30693225  30693232
2    chr1 1372666 1372671    chr1  50498571  50498576
3    chr1 1469655 1469671    chr4  17613155  17613171
4    chr1 1544339 1544341    chr6 134524667 134524669

> #check a sample with genome_build: hg38 (is available in get_combined_sv)
> dplyr::filter(all_sv_grch37, tumour_sample_id == "01-17074T") %>% dplyr::select(1:6) %>% head(4)
  CHROM_A  START_A    END_A CHROM_B  START_B    END_B
1      17 12740288 12740290      17 16931849 16931851
2      19 55960487 55960489       9 98919902 98919904
3      22 29578704 29578707      22 29817673 29817676

> dplyr::filter(all_sv_hg38, tumour_sample_id == "01-17074T") %>% dplyr::select(1:6) %>% head(4)
  CHROM_A  START_A    END_A CHROM_B  START_B    END_B
1   chr17 12836972 12836973   chr17 17028536 17028537
2   chr19 55449121 55449122    chr9 96157621 96157622
3   chr22 29182717 29182719   chr22 29421685 29421687