morinlab / GAMBLR

Set of standardized functions to operate with genomic data
https://morinlab.github.io/GAMBLR/
MIT License
3 stars 2 forks source link

get_coding_ssm for capture data does not return variants #87

Closed Kdreval closed 2 years ago

Kdreval commented 2 years ago

The get_coding_ssm for capture data returns maf file for 1-2 samples. This is a reproducible example:

capture_meta <- get_gambl_metadata(seq_type_filter = c('capture')) %>%
  dplyr::filter(consensus_pathology =='DLBCL') %>%
  dplyr::filter(COO_consensus == 'ABC')

capture_abc_maf <- get_coding_ssm(limit_samples = capture_meta$sample_id, 
                                  basic_columns = TRUE, 
                                  exclude_cohort = c('dlbcl_chapuy'),
                                  seq_type = "capture")

which returns data for only 1 sample

reading from: /projects/nhl_meta_analysis_scratch/gambl/results_local/all_the_things/slms_3-1.0_vcf2maf-1.3/capture--projection/deblacklisted/augmented_maf/all_slms-3--grch37.CDS.maf
mutations from 1023 samples
after linking with metadata, we have mutations from 1 samples

I think this is because the call to all_meta here always uses the default, which is genome. I think modifying this to

all_meta = get_gambl_metadata(from_flatfile = from_flatfile, seq_type = seq_type)

should resolve the issue

mattssca commented 2 years ago

Can this issue be closed?

Kdreval commented 2 years ago

Thanks Adam, I'll test this today and see if this works now!

Kdreval commented 2 years ago

I don't think this is resolved yet. Does the version in the current PR have different output?

> capture_meta <- get_gambl_metadata(seq_type_filter = c('capture')) %>%
+   dplyr::filter(consensus_pathology =='DLBCL') %>%
+   dplyr::filter(COO_consensus == 'ABC')

> capture_abc_maf <- get_coding_ssm(limit_samples = capture_meta$sample_id, 
+                                   basic_columns = TRUE, 
+                                   exclude_cohort = c('dlbcl_chapuy'),
+                                   seq_type = "capture")
reading from: /projects/nhl_meta_analysis_scratch/gambl/results_local/all_the_things/slms_3-1.0_vcf2maf-1.3/capture--projection/deblacklisted/augmented_maf/all_slms-3--grch37.CDS.maf
|--------------------------------------------------|
|==================================================|
mutations from 1023 samples
after linking with metadata, we have mutations from 1 samples
rdmorin commented 2 years ago

we have to make sure anywhere get_codingssm (or any get* function is called) always requires and passes seq_type along.

mattssca commented 2 years ago

This is the output I am getting when running the above code on my branch (pending PR):

reading from: /projects/nhl_meta_analysis_scratch/gambl/results_local/all_the_things/slms_3-1.0_vcf2maf-1.3/capture--projection/deblacklisted/augmented_maf/all_slms-3--grch37.CDS.maf                                                                      
|--------------------------------------------------|
|==================================================|
mutations from 1023 samples
after linking with metadata, we have mutations from 242 samples
mattssca commented 2 years ago

I believe this was fixed in removing the duplicated lines related to all_meta and the addition of the line all_meta = dplyr::filter(all_meta, seq_type == {{ seq_type }})

Kdreval commented 2 years ago

This will fix this issue then!

rdmorin commented 2 years ago

Is 242 the right/expected number of samples for capture? Seems low.

Kdreval commented 2 years ago

This is only restricted to ABC exomes and excluding Chapuy. The subsetting is done to test how different arguments work and are arbitrary/randomly selected, and the number seems reasonable. There are a total 244 exomes matching that criteria, I think, so mutations from 242 are close to expected. I would need to figure out what are the missing 2 though after the current PR is on master

rdmorin commented 2 years ago

Got it! I'll just say something I think I have said before but want to make sure we're on the same page about. I would prefer our functions to start using the these_samples_metadata (i.e. automatically figuring out what samples the user wants) instead of a sample_id vector. In the above example this would look like the following. I hope our code currently works the same both ways.

capture_meta <- get_gambl_metadata(seq_type_filter = c('capture')) %>%
   dplyr::filter(consensus_pathology =='DLBCL') %>%
   dplyr::filter(COO_consensus == 'ABC')

 capture_abc_maf <- get_coding_ssm(these_samples_metadata = capture_meta, 
                                   basic_columns = TRUE, 
                                   exclude_cohort = c('dlbcl_chapuy'),
                                   seq_type = "capture")