saketkc / pysradb

Package for fetching metadata and downloading data from SRA/ENA/GEO
https://saketkc.github.io/pysradb
BSD 3-Clause "New" or "Revised" License
307 stars 50 forks source link

[ENH] Support GEO downloads #120

Closed saketkc closed 3 years ago

saketkc commented 3 years ago

Is your feature request related to a problem? Please describe. pysradb currently only allows downloading raw (or mapped) sequence data (.sra/.fastq/.bam). It should also support downloading datasets from GEO

Describe the solution you'd like

pysradb download -p GSE111108

should download all processed files associated with GSE111108, in this case it would be GSE111108_RAW.tar

saketkc commented 3 years ago

Example R code I wrote that needs to be ported to Python:

library(httr)
library(XML)
library(stringr)

FetchGEOFiles <- function(gse, download.dir=getwd(), download.files=FALSE, ... ){
    url.prefix <- "https://ftp.ncbi.nlm.nih.gov/geo/series/"

    gse_prefix <- paste0(substr(gse,1,nchar(gse)-3), "nnn")

    url <- paste0(url.prefix, gse_prefix, "/", gse, "/", "suppl", "/")
    response <- GET(url)
    html_parsed <- htmlParse(response)
    links <- xpathSApply(html_parsed, "//a/@href")
    suppl_files <- as.character(grep('^G', links ,value=TRUE))

    file.url <- paste0(url, suppl_files)
    file_list <- data.frame(filename=suppl_files, url=file.url)

    if (download.files){
        names(file.url) <- suppl_files
        download_file <- function(url, filename, ...){
            message(paste0("Downloading ", filename, " to ", download.dir))
            download.file(url = url, destfile = file.path(download.dir, filename), mode="wb", ...)
            message("Done!")
        }
        lapply(seq_along(file.url), function(y, n, i) { download_file(y[[i]], n[[i]], ...) }, y=file.url, n=names(file.url))
    }

    return (file_list)
}

FetchGEOFiles("GSE132044", download.files = T)
saketkc commented 3 years ago

Additionally, it should be able to get the contents of the ".tar" files that are uploaded sometimes. For example: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE161707

DevangThakkar commented 3 years ago

Hey Saket, I wanted to confirm that your use of the argument -p was intentional since that is already being used to download projects based on SRP id. Theoretically, that makes kinda sense, but I just wanted to confirm that since that would entail some restrictions on other arguments such as -x\--srx.

saketkc commented 3 years ago

That's a good point @DevangThakkar! It should instead be slightly more reflective of GEO (though it is possible to easily figure out if it is indeed a GEO id internally). But to keep it consistent with the current CLI (which requires -p) it is probably best to have:

pysradb download -g GSE111108
saketkc commented 3 years ago

Closed via #129