theislab / sfaira

data and model repository for single-cell data
https://sfaira.readthedocs.io
BSD 3-Clause "New" or "Revised" License
133 stars 13 forks source link

fix single cell portal donwload links and paths #704

Closed le-ander closed 1 year ago

le-ander commented 1 year ago

@felix0097 I stumbled across some dataloaders in sfaira which rely on data from the broad single cell portal. given that download from there cannot be done by sfaira automatically (portal requires manual login), this needs to be stated in the download url. otherwise sfaira will crash when it's asked to download data for these dataloaders.

I have also adapted the loader paths as it seems like these relied on manually created subdirectories. I guess we want files to all be in the toplevel datadir.

do these changes look good to you or did I miss something here?

felix0097 commented 1 year ago

looks good to me, but didn't test the data loaders myself

davidsebfischer commented 1 year ago

With these, I tried to put the bulk download link to the collection on SCP (there is a bottom on the top on that page somewhere where you can download all data, but I couldnt link that) in as the download link to make manual download easier and then point to the decompressed files in the loader, i think in this case this makes sense as we wont get around private downloads for this portal, ie you only have to visit the site of each collection once and press one bottom. So I would keep as is but prefix private to the exisiting URL and maybe add this in the data usage documentation on read the docs?

le-ander commented 1 year ago

Ah I see! Didn't see the bulk download button before. I'd be a bit reluctant to introduce an entirely download workflow for the SCP as it's not a mjor source of datasets for us. I feel like we're buying potential ease of use with quite a lot of potential confusion if the download procedure is no longer self-explanatory but rather has special cases which are documented separately. Additionally we're not usually requiring all the files form the SCP and running wget on say up to 6 download inks seems feasible as these datasets are a bit of edge case anyways. So not sure introducing this extra complexity is worth it but I'm of course happy to be convinced otherwise.