Closed zji90 closed 6 years ago
Hi Zhicheng,
Thanks for using SCnorm!
The input should be un-normalized gene expression obtained from those methods. They do not necessarily need to be exact counts since measures like RSEM give non-integer expression in the form of Expected Counts. TPM/FPKM/RPKM force the sequencing depth to be exactly one million and so SCnorm would no longer be able to estimate the relationship of each gene's expression versus the sequencing depth. I would use whatever expression measure existed prior to converting to TPM/FPKM/RPKM.
I hope that helps and please don't hesitate to contact me if you have any further questions.
Best, Rhonda
Thanks for the reply! I am wondering whether the following procedure is a good practice for down-stream analysis, particularly dimension reduction: get normalized counts from SCnorm, log2 transform it, for each gene divide the log normalized counts by its gene length, and do PCA, etc.
Yes, but you might consider dividing by the length before applying the log. For example, if gene X has twice as many counts as gene Y but gene X is also twice as long, then I would want their value going into the PCA to be equal, which means you'd want to do the length correction prior to the log. Otherwise, that seems fine to me.
-Rhonda
Thanks!
Just wondering whether different kinds of input measures will affect the results? It is stated in the manual that "Estimates of gene expression are typically obtained using RSEM, HTSeq, Cufflinks, Salmon or similar approaches." It seems that these softwares generate different gene expression measures (counts and FPKM/TPM). Is the count data the recommended data type to be fed in? Also for the normalized gene expression counts, is there any specific steps recommended before doing down-stream analysis such as PCA or differential analysis?