theislab / sfaira

data and model repository for single-cell data
https://sfaira.readthedocs.io
BSD 3-Clause "New" or "Revised" License
134 stars 11 forks source link

Download link problems: homosapiens_None_2021_None_renxianwen_001_10.1016/j.cell.2021.01.053 #430

Open le-ander opened 2 years ago

le-ander commented 2 years ago

homosapiens_None_2021_None_renxianwen_001_10.1016/j.cell.2021.01.053 cannot be downloaded with sfaira because the download link is a google drive link which does not support programmatic download.

If this data was shared with the authors to be added to sfaira and be publicly available we should probably ask them to put it in a different place (eg. figshare) from where it can be programmatically downloaded. If the data is not meant to be shared withe the public, we should drop the download link from the dataloader.

@lauradmartens or @davidsebfischer could you add some insight here? :)

lauradmartens commented 2 years ago

It's public data but they use google drive to store it. I used gdown https://github.com/wkentaro/gdown to download it programmatically, if I remember correctly

le-ander commented 2 years ago

Thanks for the info, Laura! :) Technically, we could expand the download capabilities of sfaira to handle gdrive links though I'm not sure how common this is.

I just saw that the data in the h5ad they provide in the gdrive is not actually raw counts but log-normalised. It looks like they provide the raw counts here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE158055 do you think you could update the dataloader to use the raw counts? In that case the automatic downoaad would be solved as well. Thanks a lot in advance!

lauradmartens commented 2 years ago

Good point! I just checked again and we do have raw counts in adata.raw from the gdrive file but I can change it to the GEO files if that is more convenient :) However, we then lose the default embedding etc.

le-ander commented 2 years ago

oooh, I did not check for adata.raw #retro @davidsebfischer what's your take? switch to GEO and loose some metadata or add gdrive download support?

davidsebfischer commented 2 years ago

I'd go for GEO int his case (as we have cell annotation in both), it's the more permanent store and we can live without embedding. but leave the grdive in a comment in accompanying the text file maybe.

le-ander commented 2 years ago

Alright, could you do that, Laura? :)

lauradmartens commented 2 years ago

Si!