Closed lkhilton closed 1 year ago
@lkhilton Please let me know if this is ready to merge or there is something to be done to address Ryan's comments? There seems to be some merge conflicts - do you want me to fix those?
@rdmorin I think I've addressed all your comments and harmonized all the arguments to match the get_ssm_by_region(s)
functions.
@Kdreval and I spoke about separating these out into new scripts because he is working on implementing this across the package. Kostia, would you mind taking a look at this PR as well?
Also if you want to do any testing, this code should all work:
gambl_meta <- get_gambl_metadata(seq_type_filter = c("capture", "genome")) %>%
filter(pathology == "DLBCL")
regions <- GAMBLR:::process_regions(only_regions = c("MYC", "BCL2", "BCL6"))
calc_mutation_frequency_bin_region(
region = regions$regions_list[1],
these_samples_metadata = gambl_meta,
)
calc_mutation_frequency_bin_regions(
these_samples_metadata = gambl_meta,
regions_list = regions$regions_list[c(1, 7, 9)]
)
heatmap_mutation_frequency_bin(
these_samples_metadata = gambl_meta,
regions_list = regions$regions_list[c(1, 7, 9)],
sortByColumns = c("lymphgen", "seq_type")
)
Thanks so much for giving the test code! It returned some errors for me though. Here is the error message:
r$> calc_mutation_frequency_bin_regions(
these_samples_metadata = gambl_meta,
regions_list = regions$regions_list[c(1, 7, 9)]
)
Warning: Multiple regions in the provided data frame have the same name. Merging these entries based on min(start) and max(end) per name value.
processing bins of size 500 across 8106 bp region
Using GAMBLR::get_ssm_by_region...
Joining with `by = join_by(sample_id)`
Joining with `by = join_by(window_start)`
Error in `dplyr::bind_rows()`:
! Argument 2 must be a data frame or a named atomic vector.
Run `rlang::last_trace()` to see where the error occurred.
Warning message:
In parallel::mclapply(names(regions), function(x) { :
scheduled core 2 encountered error in user code, all values of the job will be affected
r$> heatmap_mutation_frequency_bin(
these_samples_metadata = gambl_meta,
regions_list = regions$regions_list[c(1, 7, 9)],
sortByColumns = c("lymphgen", "seq_type")
)
Warning: Multiple regions in the provided data frame have the same name. Merging these entries based on min(start) and max(end) per name value.
processing bins of size 500 across 8106 bp region
Using GAMBLR::get_ssm_by_region...
Joining with `by = join_by(sample_id)`
Joining with `by = join_by(window_start)`
Error in `dplyr::bind_rows()`:
! Argument 2 must be a data frame or a named atomic vector.
Run `rlang::last_trace()` to see where the error occurred.
Warning message:
In parallel::mclapply(names(regions), function(x) { :
scheduled core 2 encountered error in user code, all values of the job will be affected
Probably related to NAs here in the regions_list
?
r$> regions$regions_list[c(1, 7, 9)]
BCL6_TSS <NA> <NA>
"3:187458526-187464632" NA NA
Here is the object regions
:
r$> regions
$regions_list
BCL6_TSS MYC_TSS BCL2_intronic BCL2_TSS
"3:187458526-187464632" "8:128748352-128749427" "18:60796984-60814103" "18:60982728-60988342"
$regions_bed
chrom start end gene region regulatory_comment name
1 3 187458526 187464632 BCL6 TSS <NA> BCL6_TSS
2 8 128748352 128749427 MYC TSS active_promoter MYC_TSS
3 18 60796984 60814103 BCL2 intronic strong_enhancer BCL2_intronic
4 18 60982728 60988342 BCL2 TSS active_promoter BCL2_TSS
If i change the regions_list to have something from the bed file and no NAs like this: regions$regions_list[c(1, 2, 4)]
, the code works and the last command generates this plot
So I think it is working as expected, thank you for updating the code!
The GitHub actions seem to fail on the latest commit for documentation reasons. This is the relevant part from the log, I think it just needs a documentation update and the PR will be ready to merge :rocket:
Thanks for testing it out, Kostia. Were you using the latest GAMBLR.data? I think my testing used the latest which seems to have a bunch more BCL6 regions than it used to, so subsetting the named vector returned by process_regions by position probably looked at empty slots if you were using older GAMBLR.data?
Thanks Laura, this was exactly the issue - I updated to the latest version and example code worked right away with no issues!
This PR includes a new helper function to consistently handle genomic region arguments to various functions. aSHM heatmaps will no longer be generated by
get_mutation_frequency_bin_matrix
- instead this function now returns only a matrix of mutation frequencies per sliding window, and a new functionheatmap_mutation_frequency_bin
generates the heatmap. This enables users to quickly generate mutation frequency matrixes without needing to ensure metadata are available and tidy.Pull Request Checklists
Important: When opening a pull request, keep only the applicable checklist and delete all other sections.
Checklist for all PRs
Required
gambl_meta <- get_gambl_metadata(seq_type_filter = c("capture", "genome")) %>% filter(pathology == "DLBCL")
process_regions(only_regions = "MYC")
calc_mutation_frequency_sliding_windows( this_region = "8:128748352-128749427", these_samples_metadata = gambl_meta )
get_mutation_frequency_bin_matrix( these_samples_metadata = gambl_meta, only_regions = c("MYC", "BCL2", "BCL6") )
heatmap_mutation_frequency_bin( these_samples_metadata = gambl_meta, only_regions = c("MYC", "BCL2", "BCL6"), sortByColumns = "lymphgen" )
Writing '404.html' -- Building function reference ------------------------------------------------- Error : In '_pkgdown.yml', topic must be a known topic name or alias ✖ Not 'grch37_ashm_regions' Error: callr subprocess failed: In '_pkgdown.yml', topic must be a known topic name or alias ✖ Not 'grch37_ashm_regions' Type .Last.error.trace to see where the error occurred
' @return nothing
' @export
' @import tidyverse ggrepel