saketkc / pysradb

Package for fetching metadata and downloading data from SRA/ENA/GEO
https://saketkc.github.io/pysradb
BSD 3-Clause "New" or "Revised" License
303 stars 49 forks source link

[BUG] aspera #204

Open NomiCentarix opened 8 months ago

NomiCentarix commented 8 months ago

Following my previous issue - I still don't get the fastq files with aspera, only empty folders, with the following code:

from pysradb.sraweb import SRAweb
SRA_OUR_DIR = "/data/NCBI_data/"
db = SRAweb()
gse_to_srp = db.gse_to_srp("GSE226189")
print("gse_to_srp shape:", gse_to_srp.shape)
display(gse_to_srp.head(2))

metadata = db.sra_metadata(gse_to_srp["study_accession"].to_list(), detailed=True)
print(metadata.shape)
display(metadata.head(2))

db.download(df=metadata.head(1), 
            url_col="ena_fastq_http_1",
            use_ascp=True,
            #threads=8,
            skip_confirmation=True,#don't ask for permmision to download
            out_dir=SRA_OUR_DIR)  

OS: AWS EC2, Ubuntu 22.04.2 LTS anaconda3 Python 3.11.5

when the url_col is the default I do get the .sra files. The link in column "ena_fastq_http_1" seems fine (http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR236/077/SRR23630177/SRR23630177_1.fastq.gz)

saketkc commented 8 months ago

Thanks for catching this. I think there is a bug in the download module. For now I would recommend saving the metadata in a csv using pysradb metadata --detailed <SRP> --saveto x.tsv and using a tool like curl/wget to download files from the *_url column

NomiCentarix commented 8 months ago

thanks for the answer. So I should use curl/wget without aspera, right?

NomiCentarix commented 8 months ago

Ok now I have a strange problem - I don't get the fastq's URLs anymore! The columns "ena_fastq_http", "ena_fastq_http" and "ena_fastq_http" are all NA. I tested the code in several environments, and no change. (the data does exist in the same path as before http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR236/077/SRR23630177/SRR23630177_1.fastq.gz)

Do you have any idea what happened?