morinlab / GAMBLR

Set of standardized functions to operate with genomic data
MIT License
4 stars 2 forks source link

Update aSHM heatmap functionality #220

Closed lkhilton closed 1 year ago

lkhilton commented 1 year ago

This PR includes a new helper function to consistently handle genomic region arguments to various functions. aSHM heatmaps will no longer be generated by get_mutation_frequency_bin_matrix - instead this function now returns only a matrix of mutation frequencies per sliding window, and a new function heatmap_mutation_frequency_bin generates the heatmap. This enables users to quickly generate mutation frequency matrixes without needing to ensure metadata are available and tidy.

Pull Request Checklists

Important: When opening a pull request, keep only the applicable checklist and delete all other sections.

Checklist for all PRs

Required

gambl_meta <- get_gambl_metadata(seq_type_filter = c("capture", "genome")) %>% filter(pathology == "DLBCL")

process_regions(only_regions = "MYC")

calc_mutation_frequency_sliding_windows( this_region = "8:128748352-128749427", these_samples_metadata = gambl_meta )

get_mutation_frequency_bin_matrix( these_samples_metadata = gambl_meta, only_regions = c("MYC", "BCL2", "BCL6") )

heatmap_mutation_frequency_bin( these_samples_metadata = gambl_meta, only_regions = c("MYC", "BCL2", "BCL6"), sortByColumns = "lymphgen" )


- [x] I ensured all dplyr functions that commonly conflict with other packages are fully qualified. 

This can be checked and addressed by running `check_functions.pl` and responding to the prompts. Test your code _after_ you do this.

NOTE FROM LAURA: I ran this script and it's cool but also clearly lots of people have not been running it because it had TONS of suggested changes unrelated to the changes I made today. 

- [x] I generated the documentation and checked for errors relating to the new function (e.g. `devtools::document()`) and added `NAMESPACE` and all other modified files in the root directory and under `man`. 

- [ ] I have rebuilt the site with `pkgdown::build_site(lazy = TRUE)` to reflect any updated package documentation.
This failed for me with the following error: 

Writing '404.html' -- Building function reference ------------------------------------------------- Error : In '_pkgdown.yml', topic must be a known topic name or alias ✖ Not 'grch37_ashm_regions' Error: callr subprocess failed: In '_pkgdown.yml', topic must be a known topic name or alias ✖ Not 'grch37_ashm_regions' Type .Last.error.trace to see where the error occurred


### Optional but preferred with PRs

- [x] I updated and/or successfully knitted a vignette that relies on the modified code (which ones?)
ssm_tutorial.Rmd has been updated to use `heatmap_mutation_frequency_bin`. 

## Checklist for New Functions

### Required

- [x] I documented my function using [ROxygen style](https://jozef.io/r102-addin-roxytags/#:~:text=Inserting%20a%20skeleton%20%2D%20Do%20this,Shift%2BAlt%2BR%20).)

- [x] Adequate function documentation (see [new-function documentation template](https://github.com/morinlab/GAMBLR#title) for more info)

- [x] My function uses a library that isn't already a dependency of GAMBLR and I made the package aware of this dependency using the function documentation `import` statement. 

Example:

' @return nothing

' @export

' @import tidyverse ggrepel



## Checklist for changes to existing code

- [x] I added/removed arguments to a function and updated documentation for all changed/new arguments

- [x] I tested the new code for compatibility with existing functionality in the Master branch (please provide a reprex of how you tested the original functionality)
This PR reflects a substantial enough change that backward compatibility is not maintained. 
Kdreval commented 1 year ago

@lkhilton Please let me know if this is ready to merge or there is something to be done to address Ryan's comments? There seems to be some merge conflicts - do you want me to fix those?

lkhilton commented 1 year ago

@rdmorin I think I've addressed all your comments and harmonized all the arguments to match the get_ssm_by_region(s) functions.

@Kdreval and I spoke about separating these out into new scripts because he is working on implementing this across the package. Kostia, would you mind taking a look at this PR as well?

lkhilton commented 1 year ago

Also if you want to do any testing, this code should all work:


gambl_meta <- get_gambl_metadata(seq_type_filter = c("capture", "genome")) %>%
    filter(pathology == "DLBCL")

regions <- GAMBLR:::process_regions(only_regions = c("MYC", "BCL2", "BCL6"))

calc_mutation_frequency_bin_region(
    region = regions$regions_list[1], 
    these_samples_metadata = gambl_meta, 
) 

calc_mutation_frequency_bin_regions(
    these_samples_metadata = gambl_meta,
    regions_list = regions$regions_list[c(1, 7, 9)]
)

heatmap_mutation_frequency_bin(
    these_samples_metadata = gambl_meta,
    regions_list = regions$regions_list[c(1, 7, 9)],
    sortByColumns = c("lymphgen", "seq_type")
)
Kdreval commented 1 year ago

Thanks so much for giving the test code! It returned some errors for me though. Here is the error message:

r$> calc_mutation_frequency_bin_regions(
        these_samples_metadata = gambl_meta,
        regions_list = regions$regions_list[c(1, 7, 9)]
    )
Warning: Multiple regions in the provided data frame have the same name. Merging these entries based on min(start) and max(end) per name value. 
processing bins of size 500 across 8106 bp region
Using GAMBLR::get_ssm_by_region...
                                                                                                                                                                                                    Joining with `by = join_by(sample_id)`
Joining with `by = join_by(window_start)`
Error in `dplyr::bind_rows()`:
! Argument 2 must be a data frame or a named atomic vector.
Run `rlang::last_trace()` to see where the error occurred.
Warning message:
In parallel::mclapply(names(regions), function(x) { :
  scheduled core 2 encountered error in user code, all values of the job will be affected

r$> heatmap_mutation_frequency_bin(
        these_samples_metadata = gambl_meta,
        regions_list = regions$regions_list[c(1, 7, 9)],
        sortByColumns = c("lymphgen", "seq_type")
    )
Warning: Multiple regions in the provided data frame have the same name. Merging these entries based on min(start) and max(end) per name value. 
processing bins of size 500 across 8106 bp region
Using GAMBLR::get_ssm_by_region...
                                                                                                                                                                                                    Joining with `by = join_by(sample_id)`
Joining with `by = join_by(window_start)`
Error in `dplyr::bind_rows()`:
! Argument 2 must be a data frame or a named atomic vector.
Run `rlang::last_trace()` to see where the error occurred.
Warning message:
In parallel::mclapply(names(regions), function(x) { :
  scheduled core 2 encountered error in user code, all values of the job will be affected

Probably related to NAs here in the regions_list?

r$> regions$regions_list[c(1, 7, 9)]
               BCL6_TSS                    <NA>                    <NA> 
"3:187458526-187464632"                      NA                      NA 

Here is the object regions:

r$> regions
$regions_list
               BCL6_TSS                 MYC_TSS           BCL2_intronic                BCL2_TSS 
"3:187458526-187464632" "8:128748352-128749427"  "18:60796984-60814103"  "18:60982728-60988342" 

$regions_bed
  chrom     start       end gene   region regulatory_comment          name
1     3 187458526 187464632 BCL6      TSS               <NA>      BCL6_TSS
2     8 128748352 128749427  MYC      TSS    active_promoter       MYC_TSS
3    18  60796984  60814103 BCL2 intronic    strong_enhancer BCL2_intronic
4    18  60982728  60988342 BCL2      TSS    active_promoter      BCL2_TSS

If i change the regions_list to have something from the bed file and no NAs like this: regions$regions_list[c(1, 2, 4)], the code works and the last command generates this plot image

So I think it is working as expected, thank you for updating the code!

The GitHub actions seem to fail on the latest commit for documentation reasons. This is the relevant part from the log, I think it just needs a documentation update and the PR will be ready to merge :rocket:

lkhilton commented 1 year ago

Thanks for testing it out, Kostia. Were you using the latest GAMBLR.data? I think my testing used the latest which seems to have a bunch more BCL6 regions than it used to, so subsetting the named vector returned by process_regions by position probably looked at empty slots if you were using older GAMBLR.data?

Kdreval commented 1 year ago

Thanks Laura, this was exactly the issue - I updated to the latest version and example code worked right away with no issues!