Data format for bulk RNA-seq with 10X UMI data as reference: counts, TPM or PFKM?

poseidonchan / TAPE

Deep learning-based tissue compositions and cell-type-specific gene expression analysis with tissue-adaptive autoencoder (TAPE)

https://sctape.readthedocs.io/

GNU General Public License v3.0

49 stars 9 forks source link

Data format for bulk RNA-seq with 10X UMI data as reference: counts, TPM or PFKM? #13

Open Junjie-Hu opened 1 year ago

Junjie-Hu commented 1 year ago

Hi, After reading the tutorials carefully, I still feel confused how to prepare the input data. In most cases, users want to get cell-type fractions from tumor bulk RNA-seq data using the 10X data as reference. On the website, the author declared seting datatype='counts', so is sc_ref the UMI matrix of 10X data? For bulkdata, should we use counts, TPM or FPKM data? Could you please give an example on the usage website? For instance, deconvolution of bulk PBMC dataset with 10X single-cell PMBC data.

Junjie-Hu commented 1 year ago

should bulk TPM or FPKM data be log2 transformed?

poseidonchan commented 1 year ago

Hi Junjie,

Sorry for the late reply. For sc_ref argument, it is the single cell data from whatever 10X or other sequencing technology. For the datatype argument, please note no matter what bulk data type is I suggest you use the default “count” argument. Any further questions are welcome!

Regards, Yanshuo

poseidonchan commented 1 year ago

should bulk TPM or FPKM data be log2 transformed?

The raw TPM or FPKM data is better than log-transformed data