vladimirsouza commented 1 year ago

This PR addresses the issue #207 - get_cn_states() output is misleading for missing samples.

Pull Request Checklists

Checklist for all PRs

Required

[x] I tested the new code for my use case (please provide a reproducible example of how you tested the new functionality)

Kostia's examples

my_regions <- grch37_lymphoma_genes_bed %>%
  head %>%
  mutate(region = paste0(
    chromosome_name,
    ":",
    start_position,
    "-",
    end_position
  )) %>%
  pull(region)

my_meta <- get_gambl_metadata() %>%
  filter(sample_id %in% c(
    "10-18191T", # This is chromium sample omitted from CNV workflows and missing the CNV results
    "HTMCP-01-06-00485-01A-01D", # Some random ids
    "012-19-01TD"
  )
  )

# More wild example where one of the IDs is clobbered (but keeping 012-19-01TD sample)
clobbered_meta_4_samples <- my_meta %>%
  slice( c(1, 1:3) ) %>% 
  mutate(sample_id = c("Imaginary_sample", .$sample_id[2:4]))

get_cn_states(
  regions_list = my_regions,
  these_samples_metadata = clobbered_meta_4_samples
)
#                           1:2487078-2496821 1:6581407-6614595 1:6650784-6674667 1:9711790-9789172 1:11166592-11322564 1:12227060-12269285                                                                            
# 10-18191T                                NA                NA                NA                NA                  NA                  NA
# Imaginary_sample                         NA                NA                NA                NA                  NA                  NA
# 012-19-01TD                               2                 1                 1                 1                   1                   2
# HTMCP-01-06-00485-01A-01D                 2                 2                 2                 2                   2                   2

get_cn_states(
  regions_list = my_regions,
  these_samples_metadata = clobbered_meta_4_samples,
  missing_data_as_diploid = TRUE
)
#                           1:2487078-2496821 1:6581407-6614595 1:6650784-6674667 1:9711790-9789172 1:11166592-11322564 1:12227060-12269285                                                                            
# 10-18191T                                 2                 2                 2                 2                   2                   2
# Imaginary_sample                          2                 2                 2                 2                   2                   2
# 012-19-01TD                               2                 1                 1                 1                   1                   2
# HTMCP-01-06-00485-01A-01D                 2                 2                 2                 2                   2                   2

All examples from get_cn_states documentation

#basic usage, generic lymphoma gene list
cn_matrix = get_cn_states(regions_bed=grch37_lymphoma_genes_bed)

myc_region <- gene_to_region(
 gene_symbol = "MYC",
 genome_build = "grch37",
 return_as = "region"
)

single_gene_cn <- get_cn_states(
 regions_list = myc_region,
 region_names = "MYC"
)

# For capture
single_gene_cn <- get_cn_states(
 regions_list = myc_region,
 region_names = "MYC",
 this_seq_type = "capture"
)

[x] I generated the documentation and checked for errors relating to the new function (e.g. devtools::document()) and added NAMESPACE and all other modified files in the root directory and under man.

Checklist for changes to existing code

[x] I added/removed arguments to a function and updated documentation for all changed/new arguments
[x] I tested the new code for compatibility with existing functionality in the Master branch (please provide a reprex of how you tested the original functionality)

get_gene_cn_and_expression is the only function that uses get_cn_states internally.

> MYC_cn_expression = get_gene_cn_and_expression("MYC")
# [1] "grep -w -F -e Hugo_Symbol -e MYC /projects/nhl_meta_analysis_scratch/gambl/results_local/icgc_dart/DESeq2-0.0_salmon-1.0/mrna--gambl-icgc-all/vst-matrix-Hugo_Symbol_tidy.tsv"

> MYC_cn_expression
# # A tibble: 1,671 × 113
# compression bam_available patient_id sample_id       seq_type capture_space genome_build
# <chr>       <lgl>         <chr>      <chr>           <chr>    <chr>         <chr>       
#   1 bam         TRUE          00-14595   00-14595_tumorA genome   none          grch37      
# 2 cram        TRUE          00-14595   00-14595_tumorB genome   none          grch37      
# 3 cram        TRUE          00-14595   00-14595_tumorC genome   none          grch37      
# 4 bam         TRUE          00-14595   00-14595_tumorD genome   none          grch37      
# 5 cram        TRUE          00-15201   00-15201_tumorA genome   none          grch37      
# 6 cram        TRUE          00-15201   00-15201_tumorB genome   none          grch37      
# 7 bam         TRUE          00-16220   00-16220_tumorB genome   none          grch37      
# 8 cram        TRUE          00-20702   00-20702T       genome   none          grch37      
# 9 cram        TRUE          00-23442   00-23442_tumorA genome   none          grch37      
# 10 cram        TRUE          00-23442   00-23442_tumorB genome   none          grch37      
# # ℹ 1,661 more rows
# # ℹ 106 more variables: tissue_status <chr>, cohort <chr>, library_id <chr>, pathology <chr>,
# #   time_point <chr>, protocol <chr>, ffpe_or_frozen <chr>, read_length <dbl>,
# #   strandedness <chr>, seq_source_type <chr>, EBV_status_inf <chr>, link_name <chr>,
# #   data_path <chr>, unix_group <chr>, biopsy_id <chr>, fastq_link_name <chr>,
# #   fastq_data_path <chr>, COO_consensus <chr>, DHITsig_consensus <chr>,
# #   COO_PRPS_class <chr>, DHITsig_PRPS_class <chr>, DLBCL90_dlbcl_call <chr>, …
# # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

rdmorin commented 1 year ago

Can you clarify whether a sample without results will return CN 2 or NA? It looks like the imaginary example has 2, which implies we know it's copy number.

vladimirsouza commented 1 year ago

The function was returning 2 for samples without results. But, answering your question, it should return NA. However, I was a bit reluctant to make this change because the code was deliberately changing NA to 2 -- see line #fill in any sample/region combinations with missing data as diploid. So I thought there might be an application for this, which I can't see at the moment (perhaps to make filtering easier). Then I created this new missing_data_as_diploid argument to let the user decides if missing data return NA or 2. But it would be fine for me to make the function always return NA for this case.

rdmorin commented 1 year ago

My understanding is that the filling of NA to 2 was intended to deal with samples that had a copy number profile but for some segments the value was NA. Perhaps the best place to start is to confirm whether the NA -> 2 approach was right to begin with or if we should do away with it completely. @Kdreval may have more background on this.

Kdreval commented 1 year ago

I think what Vladimir did here is great. Initially this was implemented because we had a mismatch between what samples were completed the CNV or not, depending on the seq type and pairing status - and this was affecting how MySQL was set up and how we pulled from there, so it was easier to implement this way. Now that we have all in place, I like that Vladimir made it user-configurable so we respect the legacy behaviour but also the default behaviour is now providing the desired output. I think this PR can go ahead and be merged - I'll approve it now.

morinlab / GAMBLR

fix get_cn_states output for missing samples #210

Pull Request Checklists

Checklist for all PRs

Required

Checklist for changes to existing code