Closed saketkc closed 3 years ago
Example R code I wrote that needs to be ported to Python:
library(httr)
library(XML)
library(stringr)
FetchGEOFiles <- function(gse, download.dir=getwd(), download.files=FALSE, ... ){
url.prefix <- "https://ftp.ncbi.nlm.nih.gov/geo/series/"
gse_prefix <- paste0(substr(gse,1,nchar(gse)-3), "nnn")
url <- paste0(url.prefix, gse_prefix, "/", gse, "/", "suppl", "/")
response <- GET(url)
html_parsed <- htmlParse(response)
links <- xpathSApply(html_parsed, "//a/@href")
suppl_files <- as.character(grep('^G', links ,value=TRUE))
file.url <- paste0(url, suppl_files)
file_list <- data.frame(filename=suppl_files, url=file.url)
if (download.files){
names(file.url) <- suppl_files
download_file <- function(url, filename, ...){
message(paste0("Downloading ", filename, " to ", download.dir))
download.file(url = url, destfile = file.path(download.dir, filename), mode="wb", ...)
message("Done!")
}
lapply(seq_along(file.url), function(y, n, i) { download_file(y[[i]], n[[i]], ...) }, y=file.url, n=names(file.url))
}
return (file_list)
}
FetchGEOFiles("GSE132044", download.files = T)
Additionally, it should be able to get the contents of the ".tar" files that are uploaded sometimes. For example: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE161707
Hey Saket, I wanted to confirm that your use of the argument -p
was intentional since that is already being used to download projects based on SRP id. Theoretically, that makes kinda sense, but I just wanted to confirm that since that would entail some restrictions on other arguments such as -x\--srx
.
That's a good point @DevangThakkar! It should instead be slightly more reflective of GEO (though it is possible to easily figure out if it is indeed a GEO id internally). But to keep it consistent with the current CLI (which requires -p
) it is probably best to have:
pysradb download -g GSE111108
Closed via #129
Is your feature request related to a problem? Please describe. pysradb currently only allows downloading raw (or mapped) sequence data (.sra/.fastq/.bam). It should also support downloading datasets from GEO
Describe the solution you'd like
should download all processed files associated with GSE111108, in this case it would be
GSE111108_RAW.tar