theislab / sfaira

data and model repository for single-cell data
https://sfaira.readthedocs.io
BSD 3-Clause "New" or "Revised" License
133 stars 13 forks source link

Number of datasets significantly changed #735

Closed yiqisu closed 8 months ago

yiqisu commented 1 year ago

Hi,

I tried to count the total number of available datasets with the following code:

ds = sfaira.data.Universe(data_path=datadir, meta_path=metadir, cache_path=cachedir)
len(ds.datasets.items())

I got the number 994 on March 30, 2023. However, when I reran the code on April 4, 2023, the total number was decreased to 307. Can you please provide an explanation on this significant change? Thank a lot!

Best, Yiqi

davidsebfischer commented 1 year ago

Hi Yiqi, were you on an up-to-date version of dev branch on both days? If the experiment on March 30 2023 was with old code, my guess would be this: Throughout last year, we restructured some data loaders so it could be that a few appear as fewer data sets now (note: number of datasets > number of studies). In any case, we only add cells to the data base unless we have good reason to remove something, so you should not see a reduction in numbers when looking at the cell number. Our restructuring of old data loaders is done on dev branch so changes like that are not likely to happen in the near future.

yiqisu commented 1 year ago

Hi David,

I appreciate your explanation and prompt response! I actually cloned the main branch since I do not use the store-cart for now. Can I assume that the data loaders in the main branch were also restructured, or should I change to the dev branch?

Besides, when I compared the key-value pairs from the two days, I also found some discrepancy. For example, there were 12 organisms such as ambystoma mexicanum, anolis carolinensis, canis lupus familiaris, homo sapiens, mus musculus, etc on March 30; while only homo sapiens, mus musculus on April 4. I was wondering if you could provide the legal pairs for the key features including organism, organ, assay_sc? Thanks!

Best, Yiqi

davidsebfischer commented 1 year ago

Thanks for checking, Yiqi, this is probably related to internet access, we did not change the data loaders in that time window but that difference would map well to the inclusion of the cellxgene library in the data universe, which depends on your internet access. You can subset to sfaira data loaders only (or cellxgene only) to control this effect!

yiqisu commented 1 year ago

I appreciate your explanation, David! I'll keep tracking.