morinlab / GAMBLR

Set of standardized functions to operate with genomic data
https://morinlab.github.io/GAMBLR/
MIT License
3 stars 2 forks source link

fix get_cn_states output for missing samples #210

Closed vladimirsouza closed 1 year ago

vladimirsouza commented 1 year ago

This PR addresses the issue #207 - get_cn_states() output is misleading for missing samples.

Pull Request Checklists

Checklist for all PRs

Required

Kostia's examples

my_regions <- grch37_lymphoma_genes_bed %>%
  head %>%
  mutate(region = paste0(
    chromosome_name,
    ":",
    start_position,
    "-",
    end_position
  )) %>%
  pull(region)

my_meta <- get_gambl_metadata() %>%
  filter(sample_id %in% c(
    "10-18191T", # This is chromium sample omitted from CNV workflows and missing the CNV results
    "HTMCP-01-06-00485-01A-01D", # Some random ids
    "012-19-01TD"
  )
  )

# More wild example where one of the IDs is clobbered (but keeping 012-19-01TD sample)
clobbered_meta_4_samples <- my_meta %>%
  slice( c(1, 1:3) ) %>% 
  mutate(sample_id = c("Imaginary_sample", .$sample_id[2:4]))

get_cn_states(
  regions_list = my_regions,
  these_samples_metadata = clobbered_meta_4_samples
)
#                           1:2487078-2496821 1:6581407-6614595 1:6650784-6674667 1:9711790-9789172 1:11166592-11322564 1:12227060-12269285                                                                            
# 10-18191T                                NA                NA                NA                NA                  NA                  NA
# Imaginary_sample                         NA                NA                NA                NA                  NA                  NA
# 012-19-01TD                               2                 1                 1                 1                   1                   2
# HTMCP-01-06-00485-01A-01D                 2                 2                 2                 2                   2                   2

get_cn_states(
  regions_list = my_regions,
  these_samples_metadata = clobbered_meta_4_samples,
  missing_data_as_diploid = TRUE
)
#                           1:2487078-2496821 1:6581407-6614595 1:6650784-6674667 1:9711790-9789172 1:11166592-11322564 1:12227060-12269285                                                                            
# 10-18191T                                 2                 2                 2                 2                   2                   2
# Imaginary_sample                          2                 2                 2                 2                   2                   2
# 012-19-01TD                               2                 1                 1                 1                   1                   2
# HTMCP-01-06-00485-01A-01D                 2                 2                 2                 2                   2                   2

All examples from get_cn_states documentation

#basic usage, generic lymphoma gene list
cn_matrix = get_cn_states(regions_bed=grch37_lymphoma_genes_bed)

myc_region <- gene_to_region(
 gene_symbol = "MYC",
 genome_build = "grch37",
 return_as = "region"
)

single_gene_cn <- get_cn_states(
 regions_list = myc_region,
 region_names = "MYC"
)

# For capture
single_gene_cn <- get_cn_states(
 regions_list = myc_region,
 region_names = "MYC",
 this_seq_type = "capture"
)

Checklist for changes to existing code

get_gene_cn_and_expression is the only function that uses get_cn_states internally.

> MYC_cn_expression = get_gene_cn_and_expression("MYC")
# [1] "grep -w -F -e Hugo_Symbol -e MYC /projects/nhl_meta_analysis_scratch/gambl/results_local/icgc_dart/DESeq2-0.0_salmon-1.0/mrna--gambl-icgc-all/vst-matrix-Hugo_Symbol_tidy.tsv"

> MYC_cn_expression
# # A tibble: 1,671 × 113
# compression bam_available patient_id sample_id       seq_type capture_space genome_build
# <chr>       <lgl>         <chr>      <chr>           <chr>    <chr>         <chr>       
#   1 bam         TRUE          00-14595   00-14595_tumorA genome   none          grch37      
# 2 cram        TRUE          00-14595   00-14595_tumorB genome   none          grch37      
# 3 cram        TRUE          00-14595   00-14595_tumorC genome   none          grch37      
# 4 bam         TRUE          00-14595   00-14595_tumorD genome   none          grch37      
# 5 cram        TRUE          00-15201   00-15201_tumorA genome   none          grch37      
# 6 cram        TRUE          00-15201   00-15201_tumorB genome   none          grch37      
# 7 bam         TRUE          00-16220   00-16220_tumorB genome   none          grch37      
# 8 cram        TRUE          00-20702   00-20702T       genome   none          grch37      
# 9 cram        TRUE          00-23442   00-23442_tumorA genome   none          grch37      
# 10 cram        TRUE          00-23442   00-23442_tumorB genome   none          grch37      
# # ℹ 1,661 more rows
# # ℹ 106 more variables: tissue_status <chr>, cohort <chr>, library_id <chr>, pathology <chr>,
# #   time_point <chr>, protocol <chr>, ffpe_or_frozen <chr>, read_length <dbl>,
# #   strandedness <chr>, seq_source_type <chr>, EBV_status_inf <chr>, link_name <chr>,
# #   data_path <chr>, unix_group <chr>, biopsy_id <chr>, fastq_link_name <chr>,
# #   fastq_data_path <chr>, COO_consensus <chr>, DHITsig_consensus <chr>,
# #   COO_PRPS_class <chr>, DHITsig_PRPS_class <chr>, DLBCL90_dlbcl_call <chr>, …
# # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
rdmorin commented 1 year ago

Can you clarify whether a sample without results will return CN 2 or NA? It looks like the imaginary example has 2, which implies we know it's copy number.

vladimirsouza commented 1 year ago

The function was returning 2 for samples without results. But, answering your question, it should return NA. However, I was a bit reluctant to make this change because the code was deliberately changing NA to 2 -- see line #fill in any sample/region combinations with missing data as diploid. So I thought there might be an application for this, which I can't see at the moment (perhaps to make filtering easier). Then I created this new missing_data_as_diploid argument to let the user decides if missing data return NA or 2. But it would be fine for me to make the function always return NA for this case.

rdmorin commented 1 year ago

My understanding is that the filling of NA to 2 was intended to deal with samples that had a copy number profile but for some segments the value was NA. Perhaps the best place to start is to confirm whether the NA -> 2 approach was right to begin with or if we should do away with it completely. @Kdreval may have more background on this.

Kdreval commented 1 year ago

I think what Vladimir did here is great. Initially this was implemented because we had a mismatch between what samples were completed the CNV or not, depending on the seq type and pairing status - and this was affecting how MySQL was set up and how we pulled from there, so it was easier to implement this way. Now that we have all in place, I like that Vladimir made it user-configurable so we respect the legacy behaviour but also the default behaviour is now providing the desired output. I think this PR can go ahead and be merged - I'll approve it now.