theislab / scib-reproducibility

Additional code and analysis from the single-cell integration benchmarking project
https://theislab.github.io/scib-reproducibility/
MIT License
52 stars 14 forks source link

Number of genes in human immune cells dataset #24

Open Yanay1 opened 1 year ago

Yanay1 commented 1 year ago

Hello,

In this file, https://github.com/theislab/scib-reproducibility/blob/main/notebooks/data_preprocessing/immune_cells/merging/Merging_all_human.ipynb

after merging the human only datasets together, the number of genes is only 12,003. This seems low to me, since each dataset has over 20,000 genes by itself. Do you know why this might be the case?

LuckyMD commented 8 months ago

Hi @Yanay1,

This is actually quite an expected number. The unfiltered gene expression matrix can have anywhere between 25k and 55k features depending on which version of the genome or transcriptome reads are aligned to. It's quite common that for the amount of datasets being merged in the immune task, the number of genes common to all datasets is around 10-15k. The other genes are probably not shared by all datsets. Note that when we merge we do an inner join, so that we don't have to give genes values of 0 counts when we actually don't know if this was true.