theislab / scib

Benchmarking analysis of data integration tools
MIT License
283 stars 62 forks source link

Question on publicly available reprocessed datasets #378

Closed kostaslazaros closed 1 year ago

kostaslazaros commented 1 year ago

Hello there,

I'm interested in using the reprocessed benchmark datasets that were used for scIB's paper. I have found the .h5ad files in the following link that was provided in the paper;

I have read the pancreas .h5ad using scanpy into an anndata object. I have noticed that the count matrix (anndata.X) is different to the counts matrix that is stored in the layers['counts'] variable (I'm referring to anndata.layers['counts']).

What is the difference between the 2? Is one the raw counts matrix and the other a processed version of it (scaled and logarithmized?).

Thanks in advance!!

LuckyMD commented 1 year ago

Copied from the e-mail:

Hi @kostaslazaros,

You can check the reproducibility notebooks for each dataset to see what was done to generate the fighsare data. These are at It should also be explained in the methods section of the paper. In short, the .layers['counts'] data is count data, or as close to count data that we can generate from full-length protocols (e.g., CEL-seq). Full length data usually store TPMs there. In adata.X you will find the log-normalized data. Normalization is done via scran and then scanpy's log1p function is used.