Closed vladimirsouza closed 5 months ago
Should we remove the capture_sample_id
column from the output table?
Should we filter out rows with NAs in the expression
column?
We should keep the capture_sample_id
column because there is a subset of samples where both genome and capture is available, so this column will be important to match those.
The rows with NAs in the expression data should 100% stay in. Otherwise we would be silently dropping the samples and this will definitely create problems downstream. If the user wants, they can filter those out themselves.
The pivot wider fail is a known bug as well. It is used to transform the data to a format where each row is a unique sample and each column is a unique gene. The all_genes is not really intended way to retrieve expression of all genes. If all_genes is the case it would be better to directly just import the wide matrix
This Slack thread discusses what to do when join_with = "genome"
.
If the user provides join_with = "mrna"
, the sample_id
column of the output refers to mrna sample ids. If join_with = "genome"
, sample_id
column refers to genome samples. But the output does not mention this (same column name). I was thinking of making the function print a warning message like Since join_with = "mrna" was provided, the sample_id column refers to mrna sample IDs.
Is this helpful? Or is it too obvious?
Moreover, if all_genes = TRUE
, I think the sample_id
column should show mrna sample ids, right? Here it's not too obvious and should be worthy to print the warning message.
From the last commit it looks like you've resolved this. Can you confirm if this is ready for re-review @vladimirsouza ?
I think it is solved, but would like to do some checks first. Moreover, I need to add some columns that Laura suggested.
I have done some checks on the data and it seems that the problem with biopsy_ids
associated with multiple mrna_sample_id
s doesn't happen, at least at the moment. This is because, although the metadata might contain more than one mrna sample_ids
associated with a biopsy_id
, this doesn't happen in the vst-matrix-Hugo_Symbol_tidy.tsv
file --- always only one mrna_sample_id
per biopsy_id
, and the other mrna_sample_id
of that same biopsy_id
is missing in the file. This happens both in the new (vst-matrix-Hugo_Symbol_tidy_test.tsv
; waiting to see the full file) and old vst-matrix-Hugo_Symbol_tidy.tsv
file.
But I think it's still worth adding those columns that Laura suggested to deal with duplications in case these mrna sample ids are included in the vst-matrix-Hugo_Symbol_tidy.tsv
file.
I have done a lot of checks on the metadata and the gene expression file (vst-matrix-Hugo_Symbol_tidy(_test).tsv
) and these are some of the most relevant results and conclusions that I got.
Although we can't have duplications using the current data, I think it's worth having a code that can prevent them for future data.
For this PR, I'm going to submit the last commit (if everything else is fine) to address Laura's idea about filtering duplications using protocol
and preservation
columns.
get_gene_expression
is ~hard-coded to~ fetch gene expression data from /projects/nhl_meta_analysis_scratch/gambl/results_local/icgc_dart/DESeq2-0.0_salmon-1.0/mrna--gambl-icgc-all/vst-matrix-Hugo_Symbol_tidy.tsv
. But for checking purpose, we can change it to use the test file /projects/nhl_meta_analysis_scratch/gambl/results_local/icgc_dart/DESeq2-0.0_salmon-1.0/mrna--Ribodepletion/vst-matrix-Hugo_Symbol_tidy_test.tsv
, which has the new structure — no genome_sample_id
and capture_sample_id
columns, and contains the new columns protocol
and ffpe_or_frozen
.
In the output table, each row is a genome_sample_id
. mrna_sample_id
is NA when there is no mrna_sample_id
from the biopsy_id
also associated with the genome_sample_id
, so there is no way to associate gene expressions with this genome_sample_id
.
This example is to test the code part where duplicated expressions are filtered out based on prioritize_rows_by
.
I used the new test file vst-matrix-Hugo_Symbol_tidy_test.tsv
.
This is my prioritize_rows_by
.
prioritize_rows_by <- c(protocol = "p1", ffpe_or_frozen = "f2")
I created a hypothetical expression_wider
in which we have the following cases:
I used hypothetical values for columns protocol
and ffpe_or_frozen
.
# creating a hypothetical expression_wider
> expression_wider <- structure(list(sample_id = c("PBL_WES_000103", "PBL_WES_000103",
"PBL_WES_000707", "PBL_WES_000707", "PBL_WES_000707_vlad", "PBL_WES_000707_vlad",
"PBL_WES_001109", "PBL_WES_001109_vlad", "PBL_WES_001702"), biopsy_id = c("PBL_WES_000103",
"PBL_WES_000103", "PBL_WES_000707", "PBL_WES_000707", "PBL_WES_000707",
"PBL_WES_000707", "PBL_WES_001109", "PBL_WES_001109", "PBL_WES_001702"
), ensembl_gene_id = c("ENSG00000136997", "ENSG00000136997",
"ENSG00000136997", "ENSG00000136997", "ENSG00000136997", "ENSG00000136997",
"ENSG00000136997", "ENSG00000136997", "ENSG00000136997"), Hugo_Symbol = c("MYC",
"MYC", "MYC", "MYC", "MYC", "MYC", "MYC", "MYC", "MYC"), mrna_sample_id = c("PBL_RNA_000103",
"PBL_RNA_000103_vlad", "PBL_RNA_001540", "PBL_RNA_001540_vlad",
"PBL_RNA_001540", "PBL_RNA_001540_vlad", "PBL_RNA_001109", "PBL_RNA_001109",
"PBL_RNA_001702"), expression = c(11.2566504837931, 11.2566504837931,
9.19617748803975, 9.19617748803975, 9.19617748803975, 9.19617748803975,
12.177981406276, 12.177981406276, 10.770873354756), patient_id = c("PBL_WES_000103",
"PBL_WES_000103", "PBL_WES_000707", "PBL_WES_000707", "PBL_WES_000707",
"PBL_WES_000707", "PBL_WES_001109", "PBL_WES_001109", "PBL_WES_001702"
), protocol = c("p1", "p2", "p1", "p1", "p2", "p2", "p1", "p1",
"p1"), ffpe_or_frozen = c("f1", "f2", "f1", "f2", "f1", "f2",
"f1", "f1", "f1")), row.names = c(NA, -9L), class = "data.frame")
> expression_wider
sample_id biopsy_id ensembl_gene_id Hugo_Symbol mrna_sample_id expression patient_id protocol ffpe_or_frozen
1 PBL_WES_000103 PBL_WES_000103 ENSG00000136997 MYC PBL_RNA_000103 11.256650 PBL_WES_000103 p1 f1
2 PBL_WES_000103 PBL_WES_000103 ENSG00000136997 MYC PBL_RNA_000103_vlad 11.256650 PBL_WES_000103 p2 f2
3 PBL_WES_000707 PBL_WES_000707 ENSG00000136997 MYC PBL_RNA_001540 9.196177 PBL_WES_000707 p1 f1
4 PBL_WES_000707 PBL_WES_000707 ENSG00000136997 MYC PBL_RNA_001540_vlad 9.196177 PBL_WES_000707 p1 f2
5 PBL_WES_000707_vlad PBL_WES_000707 ENSG00000136997 MYC PBL_RNA_001540 9.196177 PBL_WES_000707 p2 f1
6 PBL_WES_000707_vlad PBL_WES_000707 ENSG00000136997 MYC PBL_RNA_001540_vlad 9.196177 PBL_WES_000707 p2 f2
7 PBL_WES_001109 PBL_WES_001109 ENSG00000136997 MYC PBL_RNA_001109 12.177981 PBL_WES_001109 p1 f1
8 PBL_WES_001109_vlad PBL_WES_001109 ENSG00000136997 MYC PBL_RNA_001109 12.177981 PBL_WES_001109 p1 f1
9 PBL_WES_001702 PBL_WES_001702 ENSG00000136997 MYC PBL_RNA_001702 10.770873 PBL_WES_001702 p1 f1
Here I start to run the code from line 199. First, the code adds the column multi_exp
.
> # add column to inform if there are more than one mrna sample id for a genome sample id
> expression_wider = split(expression_wider$mrna_sample_id, expression_wider$sample_id) %>%
+ .[lengths(.) > 1] %>%
+ names %>%
+ { mutate(expression_wider, multi_exp = ifelse(sample_id %in% ., 1, 0)) }
> expression_wider
sample_id biopsy_id ensembl_gene_id Hugo_Symbol mrna_sample_id expression patient_id protocol ffpe_or_frozen multi_exp
1 PBL_WES_000103 PBL_WES_000103 ENSG00000136997 MYC PBL_RNA_000103 11.256650 PBL_WES_000103 p1 f1 1
2 PBL_WES_000103 PBL_WES_000103 ENSG00000136997 MYC PBL_RNA_000103_vlad 11.256650 PBL_WES_000103 p2 f2 1
3 PBL_WES_000707 PBL_WES_000707 ENSG00000136997 MYC PBL_RNA_001540 9.196177 PBL_WES_000707 p1 f1 1
4 PBL_WES_000707 PBL_WES_000707 ENSG00000136997 MYC PBL_RNA_001540_vlad 9.196177 PBL_WES_000707 p1 f2 1
5 PBL_WES_000707_vlad PBL_WES_000707 ENSG00000136997 MYC PBL_RNA_001540 9.196177 PBL_WES_000707 p2 f1 1
6 PBL_WES_000707_vlad PBL_WES_000707 ENSG00000136997 MYC PBL_RNA_001540_vlad 9.196177 PBL_WES_000707 p2 f2 1
7 PBL_WES_001109 PBL_WES_001109 ENSG00000136997 MYC PBL_RNA_001109 12.177981 PBL_WES_001109 p1 f1 0
8 PBL_WES_001109_vlad PBL_WES_001109 ENSG00000136997 MYC PBL_RNA_001109 12.177981 PBL_WES_001109 p1 f1 0
9 PBL_WES_001702 PBL_WES_001702 ENSG00000136997 MYC PBL_RNA_001702 10.770873 PBL_WES_001702 p1 f1 0
And now, duplicated gene expression rows are filtered out.
### filter out duplicated expressions based on prioritize_rows_by
> if(!missing(prioritize_rows_by)){
+
+ prioritization_colNames = names(prioritize_rows_by)
+ prioritization_values = unname(prioritize_rows_by)
+
+ # take only duplicated gene expression rows and split them by biopsy_id and sample_id
+ multi_exp_only_split = dplyr::filter(expression_wider, multi_exp == 1)
+ split_by = paste(multi_exp_only_split$sample_id, multi_exp_only_split$biopsy_id)
+ multi_exp_only_split = split(multi_exp_only_split, split_by)
+
+ # use the first given column for the filtering. if any duplication remains,
+ # use the next column, and so on.
+ for(i in seq_along(prioritize_rows_by)){
+ multi_exp_only_split = lapply(multi_exp_only_split, function(x){
+ k = x[,prioritization_colNames[i]] == prioritization_values[i]
+ if( !any(k) ) k[] = TRUE # if there is no row with a high priority value,
+ # keep expressions with lower priority
+ x[k,]
+ })
+ }
+
+ # update values in multi_exp column
+ row_num = sapply(multi_exp_only_split, nrow)
+ no_dup = row_num == 1
+ no_multi_exp = mapply(function(nd, rn){
+ rep(nd, rn)
+ }, no_dup, row_num, SIMPLIFY = FALSE, USE.NAMES = FALSE) %>%
+ unlist
+ multi_exp_only = bind_rows(multi_exp_only_split) ; rm(multi_exp_only_split)
+ multi_exp_only$multi_exp[no_multi_exp] = 0
+
+ # update expression_wider. this time duplicated rows fixed (for those biopsy ids
+ # that was possible)
+ expression_wider = dplyr::filter(expression_wider, multi_exp == 0) %>%
+ bind_rows(multi_exp_only) %>%
+ arrange(Hugo_Symbol, ensembl_gene_id, biopsy_id, sample_id)
+ }
# the final output table
> expression_wider
sample_id biopsy_id ensembl_gene_id Hugo_Symbol mrna_sample_id expression patient_id protocol ffpe_or_frozen multi_exp
1 PBL_WES_000103 PBL_WES_000103 ENSG00000136997 MYC PBL_RNA_000103 11.256650 PBL_WES_000103 p1 f1 0
2 PBL_WES_000707 PBL_WES_000707 ENSG00000136997 MYC PBL_RNA_001540_vlad 9.196177 PBL_WES_000707 p1 f2 0
3 PBL_WES_000707_vlad PBL_WES_000707 ENSG00000136997 MYC PBL_RNA_001540_vlad 9.196177 PBL_WES_000707 p2 f2 0
4 PBL_WES_001109 PBL_WES_001109 ENSG00000136997 MYC PBL_RNA_001109 12.177981 PBL_WES_001109 p1 f1 0
5 PBL_WES_001109_vlad PBL_WES_001109 ENSG00000136997 MYC PBL_RNA_001109 12.177981 PBL_WES_001109 p1 f1 0
6 PBL_WES_001702 PBL_WES_001702 ENSG00000136997 MYC PBL_RNA_001702 10.770873 PBL_WES_001702 p1 f1 0
Then you can see that each genome_sample_id
is associated with a single mrna_sample_id
— only one expression level.
I tested the updated function by installing this branch with renv::install("morinlab/GAMBLR.results@0b8e4ceb5f4875c70e50e02e5d762156645a5414")
- taking the SHA of the most recent commit on this PR.
I changed my config.yml
file to pull in the test table, and I can confirm that this is retrieved by GAMBLR:
> config::get("results_merged")$tidy_expression_path
[1] "icgc_dart/DESeq2-0.0_salmon-1.0/mrna--Ribodepletion/vst-matrix-Hugo_Symbol_tidy_test.tsv"
I ran the following code to test the function on the data available in the above table:
mrna_rb <- get_gambl_metadata(seq_type_filter = "mrna") %>%
filter(unix_group == "icgc_dart") %>%
filter(protocol == "Ribodepletion")
mrna_test <- get_gambl_metadata() %>%
filter(patient_id %in% mrna_rb$patient_id)
get_gene_expression(
these_samples_metadata = mrna_test,
join_with = "genome",
hugo_symbols = c("MYC", "BCL2")
) %>%
filter(!is.na(MYC))
# Returns an empty tibble
get_gene_expression(
these_samples_metadata = mrna_rb,
join_with = "mrna", # This can be any value, it doesn't change the output
hugo_symbols = c("MYC", "BCL2")
) %>%
filter(!is.na(MYC))
# Returns:
# A tibble: 16 × 6
# sample_id patient_id protocol ffpe_or_frozen BCL2 MYC
# <chr> <chr> <chr> <chr> <dbl> <dbl>
I think the expected behaviour here is that join_with = genome
should return a tibble of gene expression values for any RNAseq sample that matches the provided genome metadata based on patient_id and biopsy_id, and the sample_id
column in the output should be derived from the seq_type specified by join_with
. Instead it seems that the function requires metadata containing rows with seq_type == mrna
to work.
Have I done something wrong in testing?
I changed get_gene_expression
to use both patient_id
s and biopsy_id
s to match mrna sample ids with genome/capture sample ids (needed when join_with = "genome"
or "capture"
)
To test the new get_gene_expression
version, I was using the test file icgc_dart/DESeq2-0.0_salmon-1.0/mrna--Ribodepletion/vst-matrix-Hugo_Symbol_tidy_test.tsv
as my tidy_expression_path
. However, due to some incompatibilities between this file and the metadata, same patient_ids from both tables frequently showed different biopsy_ids. Hence, the expression of many sample ids were NAs, because genome/capture samples in the metadata couldn't be matched to mrna samples in the tidy_expression_path file.
To be possible to continue testing get_gene_expression
, Laura created the new tidy_expression_path test file gambl/DESeq2-0.0_salmon-1.0/mrna--PolyA/vst-matrix-Hugo_Symbol_tidy_test.tsv
, which is more compatible with the metadata.
Then, to test the function again using Laura's tidy_expression_path file, I subset my these_metadata_samples
to contain only patient_ids
s that are also contained in the expression table.
# get metadata with only those patient ids that are contained in the expression data
> exp_table <- readr::read_tsv("/projects/nhl_meta_analysis_scratch/gambl/results_local/gambl/DESeq2-0.0_salmon-1.0/mrna--PolyA/vst-matrix-Hugo_Symbol_tidy_test.tsv")
these_samples_metadata <- get_gambl_metadata(seq_type_filter = "genome") %>%
filter(patient_id %in% exp_table$patient_id)
> dim(these_samples_metadata)
[1] 333 86
# get gene expression
> exp_genome <- get_gene_expression(
these_samples_metadata = these_samples_metadata,
join_with = "genome",
hugo_symbols = c("MYC", "BCL2")
)
# count NAs
> table(is.na(exp_genome$MYC))
FALSE TRUE
302 31
Important: The expression data contains only one sample id (or biopsy id) for each patient id.
Out of these 31 samples that got NA expressions, 27 are from patients that the tidy_expression_path file shows expression for another sample (biopsy) id. And the other 4 of these samples are from 2 patient ids that are not in the tidy_expression_path file.
@lkhilton @Kdreval @rdmorin From my last post above, we see that the function code is working. Have any of you done any other tests? Can this PR be approved? Is there any specific test that you would like me to show here or on Slack?
Thanks for your feedback, @lkhilton . It's fixed now.
This pull request is to fix
get_gene_expression
. The error was reported in this issue.Here are some examples running
get_gene_expression
.This call of
get_gene_expression
, usingjoin_with = "mrna"
, was working before the changes in this PR and it's still working in the same way.When using
join_with = "genome"
, the function didn't used to work but it's working now.When using
all_genes = TRUE
,get_gene_expression
wasn't working before this PR and it's still not working (no changes related to theall_genes = TRUE
scenario were applied).3.1. This error happens in the calling of
pivot_wider
in this part3.2. I don't understand why
pivot_wider
is called there. If we subset the data to a fewer genes,pivot_wider
finishes without error.3.3. And this is the final output if I continue running
get_gene_expression
code after the subset inwide_expression_data
showed above, which looks different from the other outputs when using different arguments — maybe it's not the desired result.