giannimonaco / ABIS

57 stars 16 forks source link

Unable to reproduce your paper results #16

Closed josemss closed 3 years ago

josemss commented 3 years ago

Hi. We are trying to reproduce your deconvolution method for RNA-seq using the matrix of mixes you provide on your github (TPMPBMC.txt). Your method returns a matrix of 'beta' coefficients that, theoretically, corresponds to the estimated proportions of each cell type in each sample. In the supplementary to your paper (https://pubmed.ncbi.nlm.nih.gov/30726743/) you provide in a file the true matrix of proportions of each cell type in each sample. Our problem is that, when comparing both matrices, we obtain neither the correlations nor the concordance correlation coefficients that you show in the paper. The correlations and concordances are very far from what you present in the paper. Also, we do not understand (we do not know how you do) the normalization 'with tilde' TPM_RLM with which it seems that the proportions should look more like the real ones. Could you clarify all these doubts? It would also help us if you provide us with the R code for the TPM_RLM normalization. Thank you.

giannimonaco commented 3 years ago

Hello, I am sorry to hear that you have problems reproducing the results. This is weird. Can you share what you got? The tilde means that the normalization is done using a factor that is unique for each cell type (and not a factor unique for each sample). The factors used are the ones reported below. They are used to normalize the signature matrix before deconvolution. Something like: SignatureMatrixNormalized<- sweep(SignatureMatrix, MARGIN=2, AbundanceNormalizationFactors, *)

"Monocytes C" 1.60188683963836 "NK" 0.749021326878403 "T CD8 Memory" 0.680038733676581 "T CD4 Naive" 0.803789736569463 "T CD8 Naive" 0.554283071383344 "B Naive" 0.830474573526442 "T CD4 Memory" 0.915518848522562 "MAIT" 0.586886579159884 "T gd Vd2" 0.978694612616693 "Neutrophils LD" 0.591354344063847 "T gd non-Vd2" 0.794170174589765 "Basophils LD" 0.468297600598404 "Monocytes NC+I" 1.69412826660189 "B Memory" 0.876070158504131 "mDCs" 4.98460363063607 "pDCs" 1.40380514821998 "Plasmablasts" 9.120624594303 "Progenitors" 5.04575898035311

On Apr 14, 2021, at 2:35 PM, josemss @.***> wrote:

Hi. We are trying to reproduce your deconvolution method for RNA-seq using the matrix of mixes you provide on your github (TPMPBMC.txt). Your method returns a matrix of 'beta' coefficients that, theoretically, corresponds to the estimated proportions of each cell type in each sample. In the supplementary to your paper (https://pubmed.ncbi.nlm.nih.gov/30726743/ https://pubmed.ncbi.nlm.nih.gov/30726743/) you provide in a file the true matrix of proportions of each cell type in each sample. Our problem is that, when comparing both matrices, we obtain neither the correlations nor the concordance correlation coefficients that you show in the paper. The correlations and concordances are very far from what you present in the paper. Also, we do not understand (we do not know how you do) the normalization 'with tilde' TPM_RLM with which it seems that the proportions should look more like the real ones. Could you clarify all these doubts? It would also help us if you provide us with the R code for the TPM_RLM normalization. Thank you.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/giannimonaco/ABIS/issues/16, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC2UTEA4OVR6ZOUDH4LA6YDTIWDXPANCNFSM425JWNZQ.

josemss commented 3 years ago

Hi. Thanks for your quick reply. Precisely our problem is the calculation of these normalization factors, we do not know how to obtain them. According to your paper, they are defined as the abundance of each cell type and verify: beta(i) = p.est(i) * alpha(i) or p.est(i) = beta(i) / alpha(i). To obtain them, you propose to minimize the RMSE with respect to alpha(i), but to perform this optimization it is necessary to know either p.est(i) or beta(i). The code to find the alpha values that minimize RMSE would be something like: alpha <- NULL for (i in 1:CellTypeNumber){ f <- function(alpha) {sqrt(sum(((beta/alpha) - p.real)^2))} a <- optimize(f, lower = 0, upper = 100, tol = 10^-8) alpha <- c(alpha, a$minimum) } But if the deconvolution has not been done yet, how do we know beta(i) or p.est(i)? Could you provide us with the code with which you have obtained those scale factors? Thanks a lot.

giannimonaco commented 3 years ago

Hi, We used the flow cytometry proportions and deconvolution to find the scaling factor for mRNA abundance. The code used is this:

rmse <- function(v1, v2) sqrt((mean((v1 - v2)^2))) fpp <- function(x){ tempCTS <- t(t(SignatureMatrix1[, CellType, drop=F]) x) SignatureMatrix2 <- SignatureMatrix1 SignatureMatrix2[,CellType] <- tempCTS Deconvolution <- t((apply(PBMCdata, 2, function(x) coef(rlm( as.matrix(SignatureMatrix2), x )))) 100) errRMSE<- rmse(Deconvolution[,CellType], FlowCytometry[,CellType]) errRMSE }

optRes <- c() for(CellType in SelectedCellTypes){ res <- optimize( fpp , c(0.1, 30)) optRes[CellType] <- res$minimum }

Unfortunately we did not have another datasets to find such factors and we used the same data used to generate figure 6. Ideally we should have had training and testing datasets. However, you can find an independent validation in the supplementary data.

josemss commented 3 years ago

Hi. We calculate the intersection of genes between the PBMC samples data and the signature matrix before use the code that you provide us: genes <- intersect(rownames(PBMCdata), rownames(SignatureMatrix1))

Your code: rmse <- function(v1, v2) sqrt((mean((v1 - v2)^2))) fpp <- function(x){ tempCTS <- t(t(SignatureMatrix1[, CellType, drop=F]) x) SignatureMatrix2 <- SignatureMatrix1 SignatureMatrix2[,CellType] <- tempCTS Deconvolution <- t((apply(PBMCdata, 2, function(x) coef(rlm( as.matrix(SignatureMatrix2), x )))) 100) errRMSE<- rmse(Deconvolution[,CellType], FlowCytometry[,CellType]) errRMSE } optRes <- c() for(CellType in SelectedCellTypes){ res <- optimize( fpp , c(0.1, 30)) optRes[CellType] <- res$minimum }

where SignatureMatrix1 (1296 genes x 17 samples) and PBMCdata (17487 genes x 17 cell types) are equal to your matrices called 'sigmatrixRNAseq.txt' and 'TPMPBMC.txt' respectively, both of them from https://github.com/giannimonaco/ABIS/tree/master/data. On the other hand, the vector CellTypes are defined as:

CellType [1] "B.Naive" "B.Memory" "Plasmablasts" "T.CD4.Naive" "T.CD8.Naive" [6] "T.CD4.Memory" "T.CD8.Memory" "T.gd.Vd2" "T.gd.non-Vd2" "MAIT" [11] "NK" "pDCs" "mDCs" "Monocytes.C" "Monocytes.NC+I" [16] "Neutrophils.LD" "Basophils.LD"

Finally, FlowCytometry is a matrix that you provide in the ‘Table S6' in your paper, but we only keep the columns whose names are "B.Naive", "B.Memory", "Plasmablasts", "T.CD4.Naive", T.CD8.Naive", "T.CD4.Memory", "T.CD8.Memory", "T.gd.Vd2", "T.gd.non-Vd2", "MAIT", "NK", "pDCs", "mDCs", "Monocytes.C", "Monocytes.NC+I", "Neutrophils.LD" and "Basophils.LD".

After apply this code we obtain the factors: B.Naive 0.9698418 B.Memory 0.9735265 Plasmablasts 0.9735688 T.CD4.Naive 0.9733962 T.CD8.Naive 0.9719577 T.CD4.Memory 0.9709556
T.CD8.Memory 0.9711050 T.gd.Vd2 0.9745512 T.gd.non-Vd2 0.9736880 MAIT 0.9706591 NK 0.9710176 pDCs 0.9714754 mDCs 0.9727940 Monocytes.C 0.9709770 Monocytes.NC+I 0.9733992 Neutrophils.LD 0.9722730 Basophils.LD 0.9725441
We don't know where the problem is so that our results are so different from yours.

We have also used your factors to normalize the signature matrix and do the deconvolution with CIBERSORT, but we did not obtain the correlations and concordances of figure 7 of your paper, they did not even come out significant.

giannimonaco commented 3 years ago

Hi, the 'sigmatrixRNAseq.txt' is already normalized for mRNA abundance. Hence you should not use that matrix. You should use something like the mean or median TPM values for each cell type (best if it weighted for the proportion of the composing cell types).

The rest seems correct.

josemss commented 3 years ago

Hi. Finally we have eliminated from the matrix 'sigmatrixRNAseq.txt' the effect of the abundance factors and we have been able to replicate your results. Thank you very much for your interest.

giannimonaco commented 3 years ago

Great to hear!

On May 12, 2021, at 10:13 AM, josemss @.***> wrote:

Hi. Finally we have eliminated from the matrix 'sigmatrixRNAseq.txt' the effect of the abundance factors and we have been able to replicate your results. Thank you very much for your interest.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/giannimonaco/ABIS/issues/16#issuecomment-839563654, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC2UTEBLTD7K5N3QNH7YWOTTNI2DVANCNFSM425JWNZQ .