morinlab / GAMBLR

Set of standardized functions to operate with genomic data
https://morinlab.github.io/GAMBLR/
MIT License
3 stars 2 forks source link

`fancy_qc_plot` bugs #118

Closed lkhilton closed 1 year ago

lkhilton commented 2 years ago

When I try to run GAMBLR::fancy_qc_plot as follows:

setwd("/projects/rmorin/projects/DLBCL_Trios/DLBCL_Trios_Manuscript")

trios_md <- read_tsv("data/metadata/metadata_sequencing.tsv") %>% 
  rename(sample_id = DNAseq_sample_id)

fancy_qc_plot(
  these_samples = trios_md$sample_id, 
  plot_data = "MeanCorrectedCoverage", 
  metadata = trios_md
)

I get the following error:

Error in `h()`:                                                                                                                                                                     
! Problem with `filter()` input `..1`.
ℹ Input `..1` is `sample_id %in% dplyr::pull(sample_table, sample_id)`.
x error in evaluating the argument 'table' in selecting a method for function '%in%': no applicable method for 'pull' applied to an object of class "character"

This traces back to the internal call to GAMBLR:::collate_qc_data, but running that doesn't give me a bug.

qc_data <- GAMBLR:::collate_qc_results(trios_md) # Returns a lovely table of qc data

So something about the way fancy_qc_plot() is passing the metadata table to collate_qc_results is not working properly.

Also could you please update the documentation to show all the possible QC metric arguments that can be supplied to plot_data along with what each shows? E.g. I'm not totally sure what MeanCorrectedCoverage returns, and it was hard to figure out that using the return_plotdata argument was there just to display the variables.

Lastly I think that collate_qc_results should be a full exported function so that users can make supplemental tables for writing papers.

TY!

mattssca commented 2 years ago

Thanks for testing out the function and your feedback Laura.

So this error highlights a known annoying issue with this function. In order for it to actually plot the samples provided in the these_samples parameter, whatever is given to that parameter has to be a df with the first column named "sample_id".

I will work on an update to this so that it can also just take a list of IDs. In the example provided above, if I first subset trios_md$sample_id to a new df and rename the column to sample_id and then give that to the parameter it works.

This example also highlights another issue, the function currently expects you to specify fill_by in order to get the correct colour palette from GAMBLR. I should update this so that this parameter is optional and allows you to specify the colour palette on your own. For example, if I go fill_by = "pathology" I get the following plot returned (attached).

I can absolutely update the documentation to make it more informative as well as export collate_qc_results.

Thanks,

setwd("/projects/rmorin/projects/DLBCL_Trios/DLBCL_Trios_Manuscript")

trios_md <- read_tsv("data/metadata/metadata_sequencing.tsv") %>% 
  rename(sample_id = DNAseq_sample_id)

samples = trios_md$sample_id
samples = as.data.frame(samples)
colnames(samples)[1] = "sample_id"

fancy_qc_plot(
  these_samples = samples, 
  plot_data = "MeanCorrectedCoverage", 
  metadata = trios_md, plot_title = "Example Plot", 
  plot_subtitle = "trios_md", 
  y_axis_lab = "Mean Corrected Coverage", 
  fill_by = "pathology")

example_plot

mattssca commented 2 years ago

I would also like to point out that for the samples you're interested in, QC data is only returned for 6 (out of 215) samples. i.e we don't have QC data for the other 209 samples in the merged file that collate_qc_results reads.

("/projects/nhl_meta_analysis_scratch/gambl/results_local/gambl/qc-1.0/99-outputs/genome.qc_metrics.tsv" and "/projects/nhl_meta_analysis_scratch/gambl/results_local/icgc_dart/qc-1.0/99-outputs/genome.qc_metrics.tsv")

rdmorin commented 2 years ago

Other gamblr functions use a vector of sample_id rather than a data frame. Wouldn't it be better to have these functions work the same way?

mattssca commented 2 years ago

Indeed, I am updating this function to do the same. This inconvenient feature is only applicable to fancy plotting functions that call collate_qc_results (I will update them all), all other fancy plotting functions use a vector of sample_id. Thanks,

lkhilton commented 2 years ago

I would also like to point out that for the samples you're interested in, QC data is only returned for 6 (out of 215) samples. i.e we don't have QC data for the other 209 samples in the merged file that collate_qc_results reads.

("/projects/nhl_meta_analysis_scratch/gambl/results_local/gambl/qc-1.0/99-outputs/genome.qc_metrics.tsv" and "/projects/nhl_meta_analysis_scratch/gambl/results_local/icgc_dart/qc-1.0/99-outputs/genome.qc_metrics.tsv")

I suspect the QC module has only been run on fresh frozen samples so far.

mattssca commented 2 years ago

Yes, I think that is an accurate assumption.

rdmorin commented 2 years ago

I think Kostia prioritized getting samples we used in the MIRAGE review run at first

rdmorin commented 2 years ago

As an aside: I've always preferred to use "these_samples_metadata" to allow the user to provide their metadata table that has been subset to the samples they're working with to directly get the sample_id for the function. Most of my functions allow the user to provide one or the other. Which one is more useful probably depends on the user's style so it's nice to have both available whenever it's feasible.

mattssca commented 2 years ago

Agree, I think this is a nice feature. Should be an easy addition to these functions. I am on it.

mattssca commented 2 years ago

The above-discussed bugs/improvements have been addressed in this commit