stemangiola / tidybulk

Brings bulk and pseudobulk transcriptomics to the tidyverse
https://stemangiola.github.io/tidybulk/
164 stars 25 forks source link

Issue with DESeq2 #215

Closed shaman-narayanasamy closed 2 years ago

shaman-narayanasamy commented 2 years ago

Dear authors/developers,

I am attempting to perform differential expression analysis with multiple methods. All of the methods work without any issue, except for DESeq2 which produces the following error: Error in DESeq2::DESeqDataSet(., design = .formula) : NA values are not allowed in the count matrix.

I checked the values of my table and found that both the scaled and raw values do not contain any NAs.

I apologise in advance if this issue was addressed previously. I was unable to locate other existing issues (open/close) related to my error. I look forward to your response.

Best regards, Shaman

stemangiola commented 2 years ago

Hello, few questions

Do all samples have all genes? Can you please thoroughly check? Are you using summarized experiment or tibble as data container? It can be that edger tollarates nas while deseq2 does not.

shaman-narayanasamy commented 2 years ago

Hello,

Yes, all the samples have all genes. In fact the initial input table was a gene (rows) x sample (columns) which was transformed using the gather function to prepare it for tidybulk analysis.

My input is a tibble as the data container.

I am not sure if it is useful information, but I am following the material from the ISMB/ECCB workshop: https://tidytranscriptomics-workshops.github.io/ismb2021_tidytranscriptomics/articles/tidytranscriptomics.html

stemangiola commented 2 years ago

Hello @shaman-narayanasamy ,

does the error appear executing the workshop as well with the provided objects?

Could you please send me a minimal reproducible example that triggers the error? You can anonymise the dataset if needed. You can send me the dataset through a google drive link.

stemangiola commented 2 years ago

Hello @shaman-narayanasamy, any news?

shaman-narayanasamy commented 2 years ago

@stemangiola , sorry. I've set this aside for a few days. I will get back to you in a few days.

-Shaman-

shaman-narayanasamy commented 2 years ago

@stemangiola ,

Once again, apologies for the late response.

Before we move further, I actually have some explaining to do about the data: it is actually mass-spectrometry-derived proteomics data, instead of transcriptomics. I actually found a relatively new R Bioconductor package that was designed to perform proteomics differential abundance/enrichment analysis, which worked quite well. This is also the reason that I forgot about this issue. However, I am happy to help you debug this issue, if you want to continue :)

Please find below the answers to your questions in your previous post:

does the error appear executing the workshop as well with the provided objects? No, the error does not appear with the workshop example. I am able to knit the markdown with the tidybulk analysis part of the tutorial with no issues. I had to make some small modifications, though. Please refer to the Appendix for more details.

Could you please send me a minimal reproducible example that triggers the error? I tried to generate a minimal working example with only two conditions for comparison (I originally have four). However, when I do this, it fails in the keep_abundant() step with the following error:

Error in h(simpleError(msg, call)) : error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': tidybulk says: you don't have any transcript that is in all samples. Please consider using impute_missing_abundance.

Just to remind you, the full dataset fails only at the DESeq2 part (see Appendix).

I tried also tried imputing (i.e. via the impute_missing_abundance() command), which works. But, when I try to keep_abundant() again it fails. Usually, proteomics data contain a lot of missing values (ether due to randomness or non-random systematic biases). Therefore, I was not surprised that the minimum working example was failing.

Considering all this, I will have to send you the full data set in order to reproduce the issue. However, the data is confidential and I will not be able to share a public link over here. Do you mind if I emailed a download link to you and give you exclusive permission to download it.

Please let me know if we could proceed this way (and your email address).

Cheers, Shaman

Appendix

Reproducing the example from the workshop

Created a reduced version of the tidytranscriptomics.Rmd file from the ISMB 2021 workshop

I only retained the parts for bulk transcriptomics analysis (removed single cell related code).

Had to install packages and make slight adjustments to the code.

Download and install:

When installing the above packages (especially those bioconductor ones), I am usually asked to install a package called “gert", which requires compilation from the source. It keeps failing in compilation. Therefore, I just skipped it, given that it was non-essential.

Replace line 769 to load the “BRCA” data set from local system

Failure point of full data set:

 test_differential_abundance(
    ~ strain_treatment,
    method = "deseq2",
    prefix = "deseq2_"
  )
Error in DESeq2::DESeqDataSet(., design = .formula) : NA values are not allowed in the count matrix.
stemangiola commented 2 years ago

I see. Sure send me an email at mangiola.swehi.edu.au with an "at" after the "s", if you know what I mean.

shaman-narayanasamy commented 2 years ago

@stemangiola file sent.

After downloading, you can run the following:

 data <- read_tsv("<file I sent you>") %>%  
  tidybulk(.transcript = Unique.Uniprot, .sample = Sample, .abundance = LFQ_intensity)

As you can see, the LFQ_intensity column is the abundance column, while Unique.Uniprot are used as gene IDs. Please let me know if you can download the file and if you are able to run the workflow. Good luck!

stemangiola commented 2 years ago

Hello @shaman-narayanasamy ,

I found the error which is not with tidybulk but with DeSEQ2. Try

read_table2("~/data_for_stmangiola.tsv")  %>% 
as_SummarizedExperiment(Sample, Unique.Uniprot, LFQ_intensity) %>%  
DESeq2::DESeqDataSet()

I think you have extraordinarily high intensities and might overflow or something. Basically try to convert them to integer yourself, and check that all are not NA.