Fix get_gene_expression

vladimirsouza commented 8 months ago

This pull request is to fix get_gene_expression. The error was reported in this issue.

Here are some examples running get_gene_expression.

This call of get_gene_expression, using join_with = "mrna", was working before the changes in this PR and it's still working in the same way.
When using join_with = "genome", the function didn't used to work but it's working now.
When using all_genes = TRUE, get_gene_expression wasn't working before this PR and it's still not working (no changes related to the all_genes = TRUE scenario were applied).

gene_exp_genome_all <- get_gene_expression(all_genes = TRUE, join_with = "genome")
# Error in `vec_init()`:                                                                                             
# ! `n` must be a single number, not an integer `NA`.
# Run `rlang::last_trace()` to see where the error occurred.
# Warning message:
# In nrow * ncol : NAs produced by integer overflow

rlang::last_trace(drop = FALSE)
# <error/rlang_error>
# Error in `vec_init()`:
# ! `n` must be a single number, not an integer `NA`.
# ---
# Backtrace:
#     ▆
#  1. ├─wide_expression_data %>% ...
#  2. ├─tidyr::pivot_wider(., names_from = ensembl_gene_id, values_from = expression)
#  3. ├─tidyr:::pivot_wider.data.frame(...)
#  4. │ └─tidyr::pivot_wider_spec(...)
#  5. │   └─vctrs::vec_init(value, nrow * ncol)
#  6. └─rlang::abort(message = message, call = call)

3.1. This error happens in the calling of pivot_wider in this part

wide_expression_data = suppressMessages(read_tsv(tidy_expression_file)) %>%
        as.data.frame() %>%
        pivot_wider(names_from = ensembl_gene_id, values_from = expression)

3.2. I don't understand why pivot_wider is called there. If we subset the data to a fewer genes, pivot_wider finishes without error.

3.3. And this is the final output if I continue running get_gene_expression code after the subset in wide_expression_data showed above, which looks different from the other outputs when using different arguments — maybe it's not the desired result.

vladimirsouza commented 8 months ago

Should we remove the capture_sample_id column from the output table?

vladimirsouza commented 8 months ago

Should we filter out rows with NAs in the expression column?

Kdreval commented 8 months ago

We should keep the capture_sample_id column because there is a subset of samples where both genome and capture is available, so this column will be important to match those. The rows with NAs in the expression data should 100% stay in. Otherwise we would be silently dropping the samples and this will definitely create problems downstream. If the user wants, they can filter those out themselves. The pivot wider fail is a known bug as well. It is used to transform the data to a format where each row is a unique sample and each column is a unique gene. The all_genes is not really intended way to retrieve expression of all genes. If all_genes is the case it would be better to directly just import the wide matrix

vladimirsouza commented 8 months ago

This Slack thread discusses what to do when join_with = "genome".