holab-hku / DCATS

MIT License
9 stars 0 forks source link

Getting NaN value for mu during multinom_EM #13

Open parkjooyoung99 opened 1 year ago

parkjooyoung99 commented 1 year ago

Hello,

I'm experiencing an error while using DCATS for differential abundance testing with my single-cell data. Specifically, the error below occurs during the dcats_GLM step Error in if (it > min_iter && logLik_new - logLik_old < logLik_threshold) { : missing value where TRUE/FALSE needed

Upon further investigation, I found that the problem originates in the multinom_EM step E, where I encounter a situation where simMM contains a zero value which ultimately generate NaN value for mu. ## E step: expectation of count from each component for (i in seq(K)) { for (j in seq(K)){ Z[i, j] = simMM[i, j] * mu[i] / sum(mu * simMM[, j]) } }

This zero value in simMM corresponds to a pair of distinct cell types that are visibly separated on a UMAP plot (indicated by the black arrow between clusters 0 - C1QC macro and 4 - Langerhans cell ).

image

My question is, how can this issue be explained? In my understanding, having a clear distinction between cell types might not necessarily affect the differential abundance of that particular cluster.

Is my understanding correct?

Thank you!

linxy29 commented 1 year ago

Hi @parkjooyoung99

Having 0 in simMM itself might not be the direct cause of the error you're encountering. Here is a simple example where the first line of simMM is ([1, 0, 0]), and it runs without triggering the error you mentioned.

library(DCATS)
simil_mat = matrix(c(1, 0, 0, 0, 0.7, 0.3, 0, 0.3, 0.7), ncol = 3)
totals1 = c(1300, 1500)
totals2 = c(1300, 1500, 1700)
diri_s1 = c(1/3, 1/3, 1/3)
diri_s2 = c(1/3, 1/2, 1/6)
sim_dat <- DCATS::simulator_base(totals1, totals2, diri_s1, diri_s2, simil_mat)
numb_cond1 = as.matrix(sim_dat$numb_cond1)
numb_cond2 = as.matrix(sim_dat$numb_cond2)
count_mat = rbind(numb_cond1, numb_cond2)
design_df = data.frame(condition = c(rep("c1", 2), rep("c2", 3)))
dcats_GLM(count_mat, design_df, simil_mat)

The error might be caused by NA values in the count matrix. Could you provide more details to assist further?

parkjooyoung99 commented 1 year ago

@linxy29 Sorry for the late response!

I have checked if I have the NA value in my count matrix, but none were the case. I am attaching my google drive link where you can access my dataset (https://drive.google.com/file/d/1-1lLotw4ryPwg2-WdymEmUBgAdu2vmiQ/view?usp=sharing ).

And below is my code

macmo = readRDS(the data) knn_mat <- knn_simMat(macmo@graphs$RNA_snn, macmo$seurat_clusters) count_mat <- table(macmo$patient, macmo$seurat_clusters) condition_patient = table(macmo$patient, macmo$condition)

df = as.data.frame(matrix(1, ncol=8)); colnames(df) = c('ceoffs','coeffs_err','LR_vals','LRT_pvals','fdr','first','second','cluster') for (i in 1:dim(combination)[2]){ first = combination[1,i] %>% str_remove('condition2') second = combination[2,i]%>% str_remove('condition2')

print(i)

idx_first = rownames(condition_patient)[condition_patient[,first] != 0] idx_second = rownames(condition_patient)[condition_patient[,second] != 0]

design_mat <- data.frame(condition = c(rep(first, length(idx_first)), rep(second, length(idx_second)))) res <- dcats_GLM(count_mat[c(idx_first,idx_second),], design_mat, similarity_mat = knn_mat) test<-cbind(res$ceoffs,res$coeffs_err,res$LR_vals,res$LRT_pvals,res$fdr ) %>% as.data.frame() names(test) <- c('ceoffs','coeffs_err','LR_vals','LRT_pvals','fdr') test$first = first test$second = second test$cluster = rownames(test) df = rbind(df, test) } df %>% View() df = df[-1,] write.csv(df,'./macmo/DCAT_res.csv')

I hope to hear from you soon! Thank you :)