statOmics / satuRn

satuRn is a highly performant and scalable method for performing differential transcript usage analyses.
https://statomics.github.io/satuRn/
20 stars 1 forks source link

[BUG] Error in locfdr, ML estimation fails #32

Closed berilerdogdu closed 9 months ago

berilerdogdu commented 9 months ago

Hello Jeroen,

Thank you for developing this great tool!

I am trying to run satuRn on a large dataset (n = 472 vs. 337), and I have run into an issue with the testDTU function. The error message I received is copied below:

Error in locfdr(zz = zvalues_mid, main = paste0("diagplot 1: ", main)) : ML estimation failed. Rerun with nulltype=2 5. stop("ML estimation failed. Rerun with nulltype=2") 4. l>ocfdr(zz = zvalues_mid, main = paste0("diagplot 1: ", main)) 3. satuRn:::p.adjust_empirical(pval, t, main = colnames(contrasts)[i], diagplot1 = TRUE, diagplot2 = TRUE) 2. satuRn::testDTU(object = sumExp, contrasts = L, diagplot1 = FALSE, diagplot2 = FALSE, sort = FALSE) 1. system.time({ sumExp <- satuRn::testDTU(object = sumExp, contrasts = L, diagplot1 = FALSE, diagplot2 = FALSE, sort = FALSE) })

My code is also copied below. The inputted transcript counts are already prefiltered. I should also note that the same code works for other datasets, and results in this error with this particular dataset.

set.seed(1)
exp_counts = data.frame(read.delim("tx_counts.txt", sep = ','))
rownames(exp_counts) <- exp_counts$tx_id
tx2gene <- exp_counts[,c("tx_id", "gene_id")]
colnames(tx2gene) <- c("isoform_id", "gene_id")

exp_counts = as.matrix(exp_counts[,!(names(exp_counts) %in% c('tx_id', 'gene_id'))])

tx2gene <- tx2gene[tx2gene$isoform_id %in% rownames(exp_counts), ]
tx2gene <- subset(tx2gene, 
                 duplicated(gene_id) | duplicated(gene_id, fromLast = TRUE))

exp_counts <- exp_counts[which(
  rownames(exp_counts) %in% tx2gene$isoform_id), ]

pheno = read.delim("pheno.txt", sep = '\t')
pheno$condition[pheno$condition == 1] <- 'control'
pheno$condition[pheno$condition == 0] <- 'case'

sumExp <- SummarizedExperiment::SummarizedExperiment(
assays = list(counts = exp_counts),
colData = pheno,
rowData = tx2gene
)

metadata(sumExp)$formula <- ~ 0 + as.factor(colData(sumExp)$condition) + as.factor(colData(sumExp)$sex)

t <- system.time({
sumExp <- satuRn::fitDTU(
  object = sumExp,
  formula = ~ 0 + condition + sex,
  parallel = FALSE,
  BPPARAM = BiocParallel::bpparam(),
  verbose = TRUE
)
})
group <- as.factor(pheno$condition)
sex <- as.factor(pheno$sex)
design <- model.matrix(~ 0 + group + sex) # construct design matrix
colnames(design) <- c(levels(group), colnames(design)[3])
L <- matrix(0, ncol = 1, nrow = ncol(design)) # initialize contrast matrix
rownames(L) <- colnames(design)
colnames(L) <- c("Contrast")
L[c("case","control"),1] <-c(1,-1)

t <- system.time({
  sumExp <- satuRn::testDTU(
  object = sumExp,
  contrasts = L,
  diagplot1 = FALSE,
  diagplot2 = FALSE,
  sort = FALSE
  )
})

I am attaching a link to count and phenotype files hoping you might be able to reproduce the error.

Thank you for your help!

jgilis commented 9 months ago

Hi @berilerdogdu,

Thank you for trying satuRn! I requested access to the link that you provided.

Best, Jeroen

jgilis commented 9 months ago

Hi @berilerdogdu,

In summary, the satuRn test fails on your data because it deems nearly all the transcripts in your data as significantly differentially used, thus breaking one of satuRn's core assumptions.

In satuRn's tests, we have two sequential steps. First, we compute p-values using a regular Wald test based on the quasibinomial generalized linear models generated by satuRn::fitDTU. Next, we perform an empirical correction on those p-values.

The first step works without issues, you can run that step separately by adding the following code to your current script (which is just testDTU's internal code):

models <- rowData(sumExp)[["fitDTUModels"]]
i <- 1 # only 1 contrast of interest, removing the internal for-loop for clarity
estimates <- vapply(X = models, FUN = function(x) tryCatch(satuRn:::getEstimates(x, 
    L[, i]), error = function(err) NA), FUN.VALUE = numeric(1))
se <- sqrt(vapply(X = models, FUN = satuRn:::varContrast, contrast = L[, 
    i], FUN.VALUE = numeric(1)))
df <- vapply(X = models, FUN = satuRn:::getDfPosterior, FUN.VALUE = numeric(1))
df[!(vapply(df, length, numeric(1)))] <- NA
df <- unlist(df)
t <- estimates/se
pval <- pt(-abs(t), df) * 2
regular_FDR <- p.adjust(pval, method = "BH")

par(mfrow=c(1,2))
hist(t, breaks=100, main = "Histogram of t-statistics")
hist(pval, breaks=100, main = "Histogram of p-values")
table(regular_FDR < 0.05)

However, inspecting the obtained t-test statistics and p-values, we see that nearly all of the transcripts have very low p-values, and nearly all FDR-adjust p-values are below 0.05 as well:

Screenshot 2023-12-21 at 12 27 05

The next step, which performs empirical correction of the p-values, involves fitting a Gaussian distribution to the mid-50% of the test statistics. We internally use the locfdr package for this with default settings. The error you get is telling us that this fit fails, and suggest running it with non-default settings. We can do this manually with this code:

library(locfdr)
zvalues <- qnorm(pval/2) * sign(t)
zvalues_mid <- zvalues[abs(zvalues) < 10]
zvalues_mid <- zvalues_mid[!is.na(zvalues_mid)]
plot <- locfdr(zz = zvalues_mid, nulltype=2, main = paste0("diagplot 1"))

Resulting in this plot:

Screenshot 2023-12-21 at 12 38 59

The blue dashed curve is a Gaussian distribution that is fitted to the mid 50% of the z-scores, which are assumed to originate from null transcripts, thus representing the estimated empirical null component densities. However, for your data, the underlying assumption that at least 50% of the transcripts in the data are not differential between groups does not hold; nearly all transcripts are differential. As such, estimating the null component density fails with default settings.


My proposed solution for you: Since nearly all genes in your data look differential, the assumptions for the empirical correction do not hold. I would therefore advice not to use the empirical correction, but work with the "raw" pvalues instead. Those can be obtained with the code above in this message. Unless you have some biological explanation for why nearly alI genes would be show differential transcript usage, I would not trust the magnitude of the p-values. The ranking of the p-values, however, may still be valuable.

Ideally, I should update satuRn so that when the empirical correction fails, it still returns the "raw" p-values. Would this be an urgently requested feature for you, or are you okay for now working with the code provided above?

Best regards,

Jeroen

berilerdogdu commented 9 months ago

Hi Jeroen,

Thank you for the thorough explanation and the recommendation. The script you provided worked for my analysis, so there is no need for an urgent feature update.

Thank you for your help and time.

Best, Beril

lkupcsik commented 2 months ago

Hi Jeroen, I see that this issue was closed by the OP, but I believe the problem still persists. Are you still planning to work on this in the near future?

jgilis commented 2 months ago

Hi @lkupcsik,

I am currently out of office for a while. Did you try the manually implemented code that I provided for the OP? Did that work? I could still implement it more properly in satuRn if that is requested, but I am afraid I could do it earliest in the first half of September.

Best,

Jeroen

lkupcsik commented 1 month ago

Thanks for the quick response, we haven't tried the posted code snippet yet, but we'll go ahead and try it manually.