morinlab / GAMBLR

Set of standardized functions to operate with genomic data
https://morinlab.github.io/GAMBLR/
MIT License
3 stars 2 forks source link

Error in `get_gene_cn_and_expression` and `get_gene_expression` #244

Closed vladimirsouza closed 6 months ago

vladimirsouza commented 1 year ago

The error is in get_gene_expression. get_gene_cn_and_expression uses get_gene_expression internally.

To reproduce the error:

library(GAMBLR)

# gene_exp internally uses this non-exported helper function
check_config_value = function(config_key){
  if(is.null(config_key)){
    stop(paste0("ATTENTION! The above described key is missing from the config, make sure your config is up to date"))
  }else{
    return(config_key)
  }
}

gene_exp = get_gene_expression(hugo_symbols = "MYC", join_with = "genome")

# [1] "grep -w -F -e Hugo_Symbol -e MYC /projects/nhl_meta_analysis_scratch/gambl/results_local/icgc_dart/DESeq2-0.0_salmon-1.0/mrna--gambl-icgc-all/vst-matrix-Hugo_Symbol_tidy.tsv"
# Error in `left_join()`:
# ! Can't join `x$sample_id` with `y$genome_sample_id` due to incompatible types.
# ℹ `x$sample_id` is a <character>.
# ℹ `y$genome_sample_id` is a <logical>.
# Run `rlang::last_trace()` to see where the error occurred.

The error happens because grep -w -F -e Hugo_Symbol -e MYC /projects/nhl_meta_analysis_scratch/gambl/results_local/icgc_dart/DESeq2-0.0_salmon-1.0/mrna--gambl-icgc-all/vst-matrix-Hugo_Symbol_tidy.tsv returns a table wherein genome_sample_id column contains only NAs --- the file doesn't contain a genome_sample_id column. capture_sample_id column is also missing.

For more details, see the discussion in this Slack thread.

lkhilton commented 8 months ago

Per discussion in lab meeting and on Slack here, I think we should update the Snakemake-integrated script that generates the various vst-matrix-Hugo_Symbol_tidy.tsv files and exclude the genome_sample_id and capture_sample_id columns. These will continue to cause issues because there are biopsies with multiple sample IDs, so data duplication will be an issue.

To preserve backwards functionality, the join_with argument can be preserved, but its effect should be to join the metadata table provided with these_samples_metadata to the tidy expression table by biopsy_id and patient_id. If join_with is set to genome_sample_id then the function can add the genome_sample_id column based on the provided metadata.

The function should also identify when a biopsy_id is repeated for any given ensembl_gene_id column. Users will have to determine which mrna_sample_id to prioritize based on sample metadata that doesn't (and shouldn't) exist in the vst-matrix-Hugo_Symbol_tidy.tsv table. The updated function could have an argument for a user to prioritize a protocol (Ribodepletion vs PolyA) and a preservation (any of the values in the ffpe_or_frozen column), which would ensure that any RNAseq data returned matches these specifications and to pre-empt data duplication per biopsy.

vladimirsouza commented 6 months ago

This issue was fixed by the Fix get_gene_expression #33 PR.