saketkc / pysradb

Package for fetching metadata and downloading data from SRA/ENA/GEO
https://saketkc.github.io/pysradb
BSD 3-Clause "New" or "Revised" License
303 stars 49 forks source link

[BUG] when one 'sra_url' is missing #175

Closed z5ouyang closed 1 year ago

z5ouyang commented 1 year ago

Describe the bug Using "from pysradb.sraweb import SRAweb" in python (command line error as well) When one sample's sra_url is missing, a none interpretable error is produce: File ".../python3.8/site-packages/pysradb/download.py", line 67, in get_file_size if url.startswith("ftp."): AttributeError: 'NAType' object has no attribute 'startswith' Finally identified the problem is one entry (SRR5617660) in the metatable downloaded through db.sra_metadata("SRP093683",detailed=True) contains 'NA' in 'sra_url' column.

To Reproduce Steps to reproduce the behavior: import os from pysradb.sraweb import SRAweb db = SRAweb() df = db.sra_metadata('SRP093683',detailed=True) db.download(df=df,skip_confirmation=True,out_dir=os.getcwd())

Desktop (please complete the following information):

Additional context Maybe check the 'sra_url' column, and produce a warning message if there is any NA, and continue with others?

saketkc commented 1 year ago

Thanks, I know this is a bit annoyning but for now I would recommend saving the metadata to a file and using a tool like wget or curl to download the files:

$ pysradb metadata --detailed SRP093683 --saveto SRP093683.tsv
$ cat SRP093683.tsv | cut -f 27
trinidadmartin commented 1 year ago

Hello! I get a different error for the same commads:

To reproduce: from pysradb.sraweb import SRAweb db = SRAweb() df = db.sra_metadata("SRP093683", detailed=True) dataset=db.download(df=df, skip_confirmation=True)

Error No URL column is found. You may wish to re-run your query with either pysradb metadata --detailed or pysradb search -v 3 Generating default download URL for each run accession...

The supplied url column "sra_url" cannot be found.

Using recommended_url instead.

Checking download URLs Key error for: ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR561/SRR5617638/SRR5617638.sra Key error for: ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR561/SRR5617639/SRR5617639.sra Key error for: ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR561/SRR5617640/SRR5617640.sra .....

None of them are valid.... How can I use the public_url provided instead?

OS: macOS Big Sur Version 11.1 python: Python 3.9.1

Thanks a lot and have a lovely day! Trini

saketkc commented 1 year ago

@TrinidadMartin For now I would recommend extracting the links from public_url and using wget to download, something like this:

pysradb metadata --detailed SRP093683 --saveto SRP093683.tsv
cut -f 25 SRP093683.tsv | sed '1d' > urls.txt
wget -c -i urls.txt
trinidadmartin commented 1 year ago

Thanks for the super fast answer!!

saketkc commented 1 year ago

This is now fixed in the develop branch.