morinlab / GAMBLR.results

A collection of functions to access results of the Genomic Analysis of Mature B-cell Lymphomas
MIT License
0 stars 0 forks source link

Fix get_gene_expression #33

Closed vladimirsouza closed 5 months ago

vladimirsouza commented 8 months ago

This pull request is to fix get_gene_expression. The error was reported in this issue.

Here are some examples running get_gene_expression.

  1. This call of get_gene_expression, using join_with = "mrna", was working before the changes in this PR and it's still working in the same way.

  2. When using join_with = "genome", the function didn't used to work but it's working now.

  3. When using all_genes = TRUE, get_gene_expression wasn't working before this PR and it's still not working (no changes related to the all_genes = TRUE scenario were applied).

gene_exp_genome_all <- get_gene_expression(all_genes = TRUE, join_with = "genome")
# Error in `vec_init()`:                                                                                             
# ! `n` must be a single number, not an integer `NA`.
# Run `rlang::last_trace()` to see where the error occurred.
# Warning message:
# In nrow * ncol : NAs produced by integer overflow

rlang::last_trace(drop = FALSE)
# <error/rlang_error>
# Error in `vec_init()`:
# ! `n` must be a single number, not an integer `NA`.
# ---
# Backtrace:
#     ▆
#  1. ├─wide_expression_data %>% ...
#  2. ├─tidyr::pivot_wider(., names_from = ensembl_gene_id, values_from = expression)
#  3. ├─tidyr:::pivot_wider.data.frame(...)
#  4. │ └─tidyr::pivot_wider_spec(...)
#  5. │   └─vctrs::vec_init(value, nrow * ncol)
#  6. └─rlang::abort(message = message, call = call)

3.1. This error happens in the calling of pivot_wider in this part

wide_expression_data = suppressMessages(read_tsv(tidy_expression_file)) %>%
        as.data.frame() %>%
        pivot_wider(names_from = ensembl_gene_id, values_from = expression)

3.2. I don't understand why pivot_wider is called there. If we subset the data to a fewer genes, pivot_wider finishes without error.

3.3. And this is the final output if I continue running get_gene_expression code after the subset in wide_expression_data showed above, which looks different from the other outputs when using different arguments — maybe it's not the desired result.

vladimirsouza commented 8 months ago

Should we remove the capture_sample_id column from the output table?

vladimirsouza commented 8 months ago

Should we filter out rows with NAs in the expression column?

Kdreval commented 8 months ago

We should keep the capture_sample_id column because there is a subset of samples where both genome and capture is available, so this column will be important to match those. The rows with NAs in the expression data should 100% stay in. Otherwise we would be silently dropping the samples and this will definitely create problems downstream. If the user wants, they can filter those out themselves. The pivot wider fail is a known bug as well. It is used to transform the data to a format where each row is a unique sample and each column is a unique gene. The all_genes is not really intended way to retrieve expression of all genes. If all_genes is the case it would be better to directly just import the wide matrix

vladimirsouza commented 8 months ago

This Slack thread discusses what to do when join_with = "genome".

vladimirsouza commented 7 months ago

If the user provides join_with = "mrna", the sample_id column of the output refers to mrna sample ids. If join_with = "genome", sample_id column refers to genome samples. But the output does not mention this (same column name). I was thinking of making the function print a warning message like Since join_with = "mrna" was provided, the sample_id column refers to mrna sample IDs. Is this helpful? Or is it too obvious?

Moreover, if all_genes = TRUE, I think the sample_id column should show mrna sample ids, right? Here it's not too obvious and should be worthy to print the warning message.

rdmorin commented 7 months ago

From the last commit it looks like you've resolved this. Can you confirm if this is ready for re-review @vladimirsouza ?

vladimirsouza commented 7 months ago

I think it is solved, but would like to do some checks first. Moreover, I need to add some columns that Laura suggested.

vladimirsouza commented 7 months ago

I have done some checks on the data and it seems that the problem with biopsy_ids associated with multiple mrna_sample_ids doesn't happen, at least at the moment. This is because, although the metadata might contain more than one mrna sample_ids associated with a biopsy_id, this doesn't happen in the vst-matrix-Hugo_Symbol_tidy.tsv file --- always only one mrna_sample_id per biopsy_id, and the other mrna_sample_id of that same biopsy_id is missing in the file. This happens both in the new (vst-matrix-Hugo_Symbol_tidy_test.tsv; waiting to see the full file) and old vst-matrix-Hugo_Symbol_tidy.tsv file. But I think it's still worth adding those columns that Laura suggested to deal with duplications in case these mrna sample ids are included in the vst-matrix-Hugo_Symbol_tidy.tsv file.

vladimirsouza commented 7 months ago

I have done a lot of checks on the metadata and the gene expression file (vst-matrix-Hugo_Symbol_tidy(_test).tsv) and these are some of the most relevant results and conclusions that I got.

Although we can't have duplications using the current data, I think it's worth having a code that can prevent them for future data.

vladimirsouza commented 7 months ago

For this PR, I'm going to submit the last commit (if everything else is fine) to address Laura's idea about filtering duplications using protocol and preservation columns.

vladimirsouza commented 7 months ago

get_gene_expression is ~hard-coded to~ fetch gene expression data from /projects/nhl_meta_analysis_scratch/gambl/results_local/icgc_dart/DESeq2-0.0_salmon-1.0/mrna--gambl-icgc-all/vst-matrix-Hugo_Symbol_tidy.tsv. But for checking purpose, we can change it to use the test file /projects/nhl_meta_analysis_scratch/gambl/results_local/icgc_dart/DESeq2-0.0_salmon-1.0/mrna--Ribodepletion/vst-matrix-Hugo_Symbol_tidy_test.tsv, which has the new structure — no genome_sample_id and capture_sample_id columns, and contains the new columns protocol and ffpe_or_frozen.

vladimirsouza commented 7 months ago

In the output table, each row is a genome_sample_id. mrna_sample_id is NA when there is no mrna_sample_id from the biopsy_id also associated with the genome_sample_id, so there is no way to associate gene expressions with this genome_sample_id.

vladimirsouza commented 7 months ago

This example is to test the code part where duplicated expressions are filtered out based on prioritize_rows_by.

I used the new test file vst-matrix-Hugo_Symbol_tidy_test.tsv.

This is my prioritize_rows_by.

prioritize_rows_by <- c(protocol = "p1", ffpe_or_frozen = "f2")

I created a hypothetical expression_wider in which we have the following cases:

I used hypothetical values for columns protocol and ffpe_or_frozen.

# creating a hypothetical expression_wider
> expression_wider <- structure(list(sample_id = c("PBL_WES_000103", "PBL_WES_000103", 
"PBL_WES_000707", "PBL_WES_000707", "PBL_WES_000707_vlad", "PBL_WES_000707_vlad", 
"PBL_WES_001109", "PBL_WES_001109_vlad", "PBL_WES_001702"), biopsy_id = c("PBL_WES_000103", 
"PBL_WES_000103", "PBL_WES_000707", "PBL_WES_000707", "PBL_WES_000707", 
"PBL_WES_000707", "PBL_WES_001109", "PBL_WES_001109", "PBL_WES_001702"
), ensembl_gene_id = c("ENSG00000136997", "ENSG00000136997", 
"ENSG00000136997", "ENSG00000136997", "ENSG00000136997", "ENSG00000136997", 
"ENSG00000136997", "ENSG00000136997", "ENSG00000136997"), Hugo_Symbol = c("MYC", 
"MYC", "MYC", "MYC", "MYC", "MYC", "MYC", "MYC", "MYC"), mrna_sample_id = c("PBL_RNA_000103", 
"PBL_RNA_000103_vlad", "PBL_RNA_001540", "PBL_RNA_001540_vlad", 
"PBL_RNA_001540", "PBL_RNA_001540_vlad", "PBL_RNA_001109", "PBL_RNA_001109", 
"PBL_RNA_001702"), expression = c(11.2566504837931, 11.2566504837931, 
9.19617748803975, 9.19617748803975, 9.19617748803975, 9.19617748803975, 
12.177981406276, 12.177981406276, 10.770873354756), patient_id = c("PBL_WES_000103", 
"PBL_WES_000103", "PBL_WES_000707", "PBL_WES_000707", "PBL_WES_000707", 
"PBL_WES_000707", "PBL_WES_001109", "PBL_WES_001109", "PBL_WES_001702"
), protocol = c("p1", "p2", "p1", "p1", "p2", "p2", "p1", "p1", 
"p1"), ffpe_or_frozen = c("f1", "f2", "f1", "f2", "f1", "f2", 
"f1", "f1", "f1")), row.names = c(NA, -9L), class = "data.frame")

> expression_wider
            sample_id      biopsy_id ensembl_gene_id Hugo_Symbol      mrna_sample_id expression     patient_id protocol ffpe_or_frozen
1      PBL_WES_000103 PBL_WES_000103 ENSG00000136997         MYC      PBL_RNA_000103  11.256650 PBL_WES_000103       p1             f1
2      PBL_WES_000103 PBL_WES_000103 ENSG00000136997         MYC PBL_RNA_000103_vlad  11.256650 PBL_WES_000103       p2             f2
3      PBL_WES_000707 PBL_WES_000707 ENSG00000136997         MYC      PBL_RNA_001540   9.196177 PBL_WES_000707       p1             f1
4      PBL_WES_000707 PBL_WES_000707 ENSG00000136997         MYC PBL_RNA_001540_vlad   9.196177 PBL_WES_000707       p1             f2
5 PBL_WES_000707_vlad PBL_WES_000707 ENSG00000136997         MYC      PBL_RNA_001540   9.196177 PBL_WES_000707       p2             f1
6 PBL_WES_000707_vlad PBL_WES_000707 ENSG00000136997         MYC PBL_RNA_001540_vlad   9.196177 PBL_WES_000707       p2             f2
7      PBL_WES_001109 PBL_WES_001109 ENSG00000136997         MYC      PBL_RNA_001109  12.177981 PBL_WES_001109       p1             f1
8 PBL_WES_001109_vlad PBL_WES_001109 ENSG00000136997         MYC      PBL_RNA_001109  12.177981 PBL_WES_001109       p1             f1
9      PBL_WES_001702 PBL_WES_001702 ENSG00000136997         MYC      PBL_RNA_001702  10.770873 PBL_WES_001702       p1             f1

Here I start to run the code from line 199. First, the code adds the column multi_exp.

> # add column to inform if there are more than one mrna sample id for a genome sample id
>     expression_wider = split(expression_wider$mrna_sample_id, expression_wider$sample_id) %>% 
+       .[lengths(.) > 1] %>% 
+       names %>% 
+       { mutate(expression_wider, multi_exp = ifelse(sample_id %in% ., 1, 0)) }

> expression_wider
            sample_id      biopsy_id ensembl_gene_id Hugo_Symbol      mrna_sample_id expression     patient_id protocol ffpe_or_frozen multi_exp
1      PBL_WES_000103 PBL_WES_000103 ENSG00000136997         MYC      PBL_RNA_000103  11.256650 PBL_WES_000103       p1             f1         1
2      PBL_WES_000103 PBL_WES_000103 ENSG00000136997         MYC PBL_RNA_000103_vlad  11.256650 PBL_WES_000103       p2             f2         1
3      PBL_WES_000707 PBL_WES_000707 ENSG00000136997         MYC      PBL_RNA_001540   9.196177 PBL_WES_000707       p1             f1         1
4      PBL_WES_000707 PBL_WES_000707 ENSG00000136997         MYC PBL_RNA_001540_vlad   9.196177 PBL_WES_000707       p1             f2         1
5 PBL_WES_000707_vlad PBL_WES_000707 ENSG00000136997         MYC      PBL_RNA_001540   9.196177 PBL_WES_000707       p2             f1         1
6 PBL_WES_000707_vlad PBL_WES_000707 ENSG00000136997         MYC PBL_RNA_001540_vlad   9.196177 PBL_WES_000707       p2             f2         1
7      PBL_WES_001109 PBL_WES_001109 ENSG00000136997         MYC      PBL_RNA_001109  12.177981 PBL_WES_001109       p1             f1         0
8 PBL_WES_001109_vlad PBL_WES_001109 ENSG00000136997         MYC      PBL_RNA_001109  12.177981 PBL_WES_001109       p1             f1         0
9      PBL_WES_001702 PBL_WES_001702 ENSG00000136997         MYC      PBL_RNA_001702  10.770873 PBL_WES_001702       p1             f1         0

And now, duplicated gene expression rows are filtered out.

### filter out duplicated expressions based on prioritize_rows_by
> if(!missing(prioritize_rows_by)){
+       
+       prioritization_colNames = names(prioritize_rows_by)
+       prioritization_values = unname(prioritize_rows_by)
+       
+       # take only duplicated gene expression rows and split them by biopsy_id and sample_id
+       multi_exp_only_split = dplyr::filter(expression_wider, multi_exp == 1)
+       split_by = paste(multi_exp_only_split$sample_id, multi_exp_only_split$biopsy_id)
+       multi_exp_only_split = split(multi_exp_only_split, split_by)
+       
+       # use the first given column for the filtering. if any duplication remains, 
+       # use the next column, and so on.
+       for(i in seq_along(prioritize_rows_by)){
+         multi_exp_only_split = lapply(multi_exp_only_split, function(x){
+           k = x[,prioritization_colNames[i]] == prioritization_values[i]
+           if( !any(k) ) k[] = TRUE  # if there is no row with a high priority value, 
+                                      # keep expressions with lower priority
+           x[k,]
+         })
+       }
+       
+       # update values in multi_exp column
+       row_num = sapply(multi_exp_only_split, nrow)
+       no_dup = row_num == 1
+       no_multi_exp = mapply(function(nd, rn){
+         rep(nd, rn)
+       }, no_dup, row_num, SIMPLIFY = FALSE, USE.NAMES = FALSE) %>% 
+         unlist
+       multi_exp_only = bind_rows(multi_exp_only_split) ; rm(multi_exp_only_split)
+       multi_exp_only$multi_exp[no_multi_exp] = 0
+       
+       # update expression_wider. this time duplicated rows fixed (for those biopsy ids 
+       # that was possible)
+       expression_wider = dplyr::filter(expression_wider, multi_exp == 0) %>% 
+         bind_rows(multi_exp_only) %>% 
+         arrange(Hugo_Symbol, ensembl_gene_id, biopsy_id, sample_id)
+     }

# the final output table
> expression_wider
            sample_id      biopsy_id ensembl_gene_id Hugo_Symbol      mrna_sample_id expression     patient_id protocol ffpe_or_frozen multi_exp
1      PBL_WES_000103 PBL_WES_000103 ENSG00000136997         MYC      PBL_RNA_000103  11.256650 PBL_WES_000103       p1             f1         0
2      PBL_WES_000707 PBL_WES_000707 ENSG00000136997         MYC PBL_RNA_001540_vlad   9.196177 PBL_WES_000707       p1             f2         0
3 PBL_WES_000707_vlad PBL_WES_000707 ENSG00000136997         MYC PBL_RNA_001540_vlad   9.196177 PBL_WES_000707       p2             f2         0
4      PBL_WES_001109 PBL_WES_001109 ENSG00000136997         MYC      PBL_RNA_001109  12.177981 PBL_WES_001109       p1             f1         0
5 PBL_WES_001109_vlad PBL_WES_001109 ENSG00000136997         MYC      PBL_RNA_001109  12.177981 PBL_WES_001109       p1             f1         0
6      PBL_WES_001702 PBL_WES_001702 ENSG00000136997         MYC      PBL_RNA_001702  10.770873 PBL_WES_001702       p1             f1         0

Then you can see that each genome_sample_id is associated with a single mrna_sample_id — only one expression level.

lkhilton commented 6 months ago

I tested the updated function by installing this branch with renv::install("morinlab/GAMBLR.results@0b8e4ceb5f4875c70e50e02e5d762156645a5414") - taking the SHA of the most recent commit on this PR.

I changed my config.yml file to pull in the test table, and I can confirm that this is retrieved by GAMBLR:

> config::get("results_merged")$tidy_expression_path
[1] "icgc_dart/DESeq2-0.0_salmon-1.0/mrna--Ribodepletion/vst-matrix-Hugo_Symbol_tidy_test.tsv"

I ran the following code to test the function on the data available in the above table:


mrna_rb <- get_gambl_metadata(seq_type_filter = "mrna") %>% 
filter(unix_group == "icgc_dart") %>% 
filter(protocol == "Ribodepletion")

mrna_test <- get_gambl_metadata() %>%
    filter(patient_id %in% mrna_rb$patient_id) 

get_gene_expression(
    these_samples_metadata = mrna_test, 
    join_with = "genome", 
    hugo_symbols = c("MYC", "BCL2")
) %>% 
filter(!is.na(MYC)) 
# Returns an empty tibble

get_gene_expression(
    these_samples_metadata = mrna_rb, 
    join_with = "mrna", # This can be any value, it doesn't change the output
    hugo_symbols = c("MYC", "BCL2")
) %>% 
filter(!is.na(MYC))
# Returns: 
# A tibble: 16 × 6
#   sample_id       patient_id protocol      ffpe_or_frozen  BCL2   MYC
#   <chr>           <chr>      <chr>         <chr>          <dbl> <dbl>

I think the expected behaviour here is that join_with = genome should return a tibble of gene expression values for any RNAseq sample that matches the provided genome metadata based on patient_id and biopsy_id, and the sample_id column in the output should be derived from the seq_type specified by join_with. Instead it seems that the function requires metadata containing rows with seq_type == mrna to work.

Have I done something wrong in testing?

vladimirsouza commented 6 months ago

I changed get_gene_expression to use both patient_ids and biopsy_ids to match mrna sample ids with genome/capture sample ids (needed when join_with = "genome" or "capture")

To test the new get_gene_expression version, I was using the test file icgc_dart/DESeq2-0.0_salmon-1.0/mrna--Ribodepletion/vst-matrix-Hugo_Symbol_tidy_test.tsv as my tidy_expression_path. However, due to some incompatibilities between this file and the metadata, same patient_ids from both tables frequently showed different biopsy_ids. Hence, the expression of many sample ids were NAs, because genome/capture samples in the metadata couldn't be matched to mrna samples in the tidy_expression_path file.

To be possible to continue testing get_gene_expression, Laura created the new tidy_expression_path test file gambl/DESeq2-0.0_salmon-1.0/mrna--PolyA/vst-matrix-Hugo_Symbol_tidy_test.tsv, which is more compatible with the metadata.

Then, to test the function again using Laura's tidy_expression_path file, I subset my these_metadata_samples to contain only patient_idss that are also contained in the expression table.

# get metadata with only those patient ids that are contained in the expression data
> exp_table <- readr::read_tsv("/projects/nhl_meta_analysis_scratch/gambl/results_local/gambl/DESeq2-0.0_salmon-1.0/mrna--PolyA/vst-matrix-Hugo_Symbol_tidy_test.tsv")
these_samples_metadata <- get_gambl_metadata(seq_type_filter = "genome") %>% 
  filter(patient_id %in% exp_table$patient_id)

> dim(these_samples_metadata)
[1] 333  86

# get gene expression
> exp_genome <- get_gene_expression(
  these_samples_metadata = these_samples_metadata, 
  join_with = "genome", 
  hugo_symbols = c("MYC", "BCL2")
)

# count NAs
> table(is.na(exp_genome$MYC))

FALSE  TRUE 
  302    31 

Important: The expression data contains only one sample id (or biopsy id) for each patient id.

Out of these 31 samples that got NA expressions, 27 are from patients that the tidy_expression_path file shows expression for another sample (biopsy) id. And the other 4 of these samples are from 2 patient ids that are not in the tidy_expression_path file.

vladimirsouza commented 5 months ago

@lkhilton @Kdreval @rdmorin From my last post above, we see that the function code is working. Have any of you done any other tests? Can this PR be approved? Is there any specific test that you would like me to show here or on Slack?

vladimirsouza commented 5 months ago

Thanks for your feedback, @lkhilton . It's fixed now.