morinlab / GAMBLR

Set of standardized functions to operate with genomic data
https://morinlab.github.io/GAMBLR/
MIT License
3 stars 2 forks source link

Collate_results is broken #132

Closed rdmorin closed 1 year ago

rdmorin commented 1 year ago

I can't seem to get collate_results to generate the cached outputs nor can I get it to load the cached result. If I specify from_cache = T it still loads the mutations as if it's ignoring that option completely. This really needs to be addressed.

collated_genome = collate_results(get_gambl_metadata() %>% dplyr::select(sample_id),TRUE,seq_type_filter = "genome",from_cache = F,write_to_file = T)

/projects/nhl_meta_analysis_scratch/gambl/results_local/shared/gambl_genome_results.tsv
Slow option: not using cached result. I suggest from_cache = TRUE whenever possible
Checking permissions on: /projects/nhl_meta_analysis_scratch/gambl/results_local/all_the_things/slms_3-1.0_vcf2maf-1.3/genome--projection/deblacklisted/maf/all_slms-3--grch37.maf
[1] "loading /projects/nhl_meta_analysis_scratch/gambl/results_local/all_the_things/slms_3-1.0_vcf2maf-1.3/genome--projection/deblacklisted/maf/all_slms-3--grch37.maf"
Rows: 17949660 Columns: 116                                                                                                                       
── Column specification ───────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (53): Hugo_Symbol, Center, NCBI_Build, Chromosome, Strand, Variant_Classification, Variant_Type, Refere...
dbl (19): Entrez_Gene_Id, Start_Position, End_Position, t_depth, t_ref_count, t_alt_count, n_depth, n_ref_c...
lgl (44): dbSNP_Val_Status, Tumor_Validation_Allele1, Tumor_Validation_Allele2, Match_Norm_Validation_Allel...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
mutations from 1566 samples
Joining, by = "sample_id"
Joining, by = "sample_id"
Joining, by = "sample_id"
Joining, by = "sample_id"                                                                                                                         
Joining, by = "sample_id"                                                                                                                         
Joining, by = "sample_id"                                                                                                                         
Joining, by = "sample_id"                                                                                                                         
Joining, by = "sample_id"                                                                                                                         
Joining, by = "sample_id"                                                                                                                         
Error in `standardise_join_by()`:                                                                                                                 
! `by` must be supplied when `x` and `y` have no common variables.
ℹ use by = character()` to perform a cross-join.
Run `rlang::last_error()` to see where the error occurred.
Warning message:
One or more parsing issues, call `problems()` on your data frame for details, e.g.:
  dat <- vroom(...)
  problems(dat) 
rdmorin commented 1 year ago
collated_genome = collate_results(get_gambl_metadata() %>% dplyr::select(sample_id, patient_id, biopsy_id),write_to_file = FALSE,seq_type_filter = "genome",from_cache = TRUE)
/projects/nhl_meta_analysis_scratch/gambl/results_local/shared/gambl_genome_results.tsv
Error in file(con, "rb") : cannot open the connection
In addition: Warning messages:
1: In file(con, "rb") :
  'raw = FALSE' but '/projects/rmorin/projects/gambl-repos/gambl-crushton-canary' is not a regular file
2: In file(con, "rb") :

 Error in file(con, "rb") : cannot open the connection 
mattssca commented 1 year ago

from_cache = TRUE (FIXED)

The chunk of code specifying the output_file should be moved to if(from_cache). Currently, the code inside if(from_cache) points to a non-existing file. Unsure how this ended up here. Maybe an unfortunate result of auto-merge. After these paths have been updated to point to shared/gambl{seq_type_filter}results.tsv the function runs as expected with from_cache = TRUE.

from_cache = FALSE (NOT FIXED)

The issue with this parameter is that one of the collate functions,collate_curated_sv_results is trying to do a left_join for one of the files (genome_EBV_status) but no common variables are available. i.e the sample_id column in genome_EBV_status.tsv is called sample. Not possible to specify by = c("sample_id" = "sample"), since this is causing an error for the other curated SV results files.

Looking into an alternative solution.

Causing error in dplyr::left_join (collate_curated_sv_results)
    Error in `left_join()` at GAMBLR/R/utilities.R:1541:4:                                                             
    `by` must be supplied when `x` and `y` have no common variables.
    use by = character()` to perform a cross-join.

This issue is being addressed in this PR

rdmorin commented 1 year ago

Can you please add the code you're running to test each use case?