tengfei-emory / scBatch

Batch Effect Correction of RNA-seq Data through Sample Distance Matrix Adjustment
GNU General Public License v3.0
17 stars 9 forks source link

How to obtain a corrected count/expression matrix? #4

Open wiceshine opened 4 years ago

wiceshine commented 4 years ago

Dear developer,

Thank you for providing this wonderful tool. I found scBatch is really helpful for correcting batch effects. However, when I want to proceed DE analysis (e.g. by edgeR), I found that the scBatchCcp return matrix is not a count or expression matrix. There are a lot of negative values in the matrix (around half of values are negative). Even for the vignette example output scbatchmod in (https://github.com/tengfei-emory/scBatch-paper-scripts/blob/master/Fig3_ENCODE_script.r). In that case, the downstream DE analysis cannot be conducted. So I was wondering how can I obtain a corredted count/expression matrix by scBatch? Thank you very much for your great help!

tengfei-emory commented 4 years ago

Thank you for your inquiries. Indeed, the negative values in the output matrix is a common issue for batch effect correction tools. To be frank, when we test our method and other methods in simulation and real data applications, we naively used matrix - min(matrix) to enforce all entries non-negative. Although this may further introduce artifacts, the clustering and DE analysis results appeared not to be heavily affected.

wiceshine commented 4 years ago

Thank you for your inquiries. Indeed, the negative values in the output matrix is a common issue for batch effect correction tools. To be frank, when we test our method and other methods in simulation and real data applications, we naively used matrix - min(matrix) to enforce all entries non-negative. Although this may further introduce artifacts, the clustering and DE analysis results appeared not to be heavily affected.

Thank you for your reply! I have tried using the count_matrix as (matrix - min(matrix)). However, since all the value are quite small, DE analysis (e.g. edgeR) cannot identify any differentially expressed gene as all P-value are quite large (> 0.05). Therefore, I don't think matrix - min(matrix) is an appropriate way to get a count_matrix. ComBat has a solution to obtain the count_matrix, namely ComBat_seq. I think scBatch may apply a similar approach to output the count_matrix. It would be great helpful for extend the utility of scBatch. Othersie, the usage of scBatch, the wonderful batch effect correction tool, would be quite limited.