theislab / scib

Benchmarking analysis of data integration tools
MIT License
283 stars 62 forks source link

Question on publicly available reprocessed datasets #378

Closed kostaslazaros closed 1 year ago

kostaslazaros commented 1 year ago

Hello there,

I'm interested in using the reprocessed benchmark datasets that were used for scIB's paper. I have found the .h5ad files in the following link that was provided in the paper; https://figshare.com/articles/dataset/Benchmarking_atlas-level_data_integration_in_single-cellgenomics-_integration_task_datasets_Immune_andpancreas/12420968

I have read the pancreas .h5ad using scanpy into an anndata object. I have noticed that the count matrix (anndata.X) is different to the counts matrix that is stored in the layers['counts'] variable (I'm referring to anndata.layers['counts']).

What is the difference between the 2? Is one the raw counts matrix and the other a processed version of it (scaled and logarithmized?).

Thanks in advance!!

LuckyMD commented 1 year ago

Copied from the e-mail:

Hi @kostaslazaros,

You can check the reproducibility notebooks for each dataset to see what was done to generate the fighsare data. These are at github.com/theislab/scib-reproducibility. It should also be explained in the methods section of the paper. In short, the .layers['counts'] data is count data, or as close to count data that we can generate from full-length protocols (e.g., CEL-seq). Full length data usually store TPMs there. In adata.X you will find the log-normalized data. Normalization is done via scran and then scanpy's log1p function is used.