Subsampling search results

Hi rentrez developers,

I have a situation where i want to search for a taxa and specific gene, and only download a random subsample of these search results.

search_results <- entrez_search(db = "nuccore", term = "Drosophila[ORGN] AND COI", use_history = TRUE)
length(search_results$ids)
[1] 8832

subsample= 1000
ids_subsample <- sample(search_results$ids, subsample)

As a subsample of 1000 ids is too large to feed directly into entrez_fetch, the only way i see to handle this is to then use entrez_post to upload the subsampled ids in chunks, and entrez_fetch to then download the chunks.

ids <- sample(search_results$ids, subsample )
chunk_size <- 100
ids_chunked <- split(ids, ceiling(seq_along(ids)/chunk_size))

for (l in 1:length(ids_chunked)) {
  upload <- entrez_post(db="nuccore", id=ids_chunked[[l]])
  dl <- entrez_fetch(db = "nuccore", web_history = upload, rettype = "fasta", retmax = 10000)
  cat(dl, file="out.fa", append=TRUE)
}

However this is quite slow. Instead, the documentation for entrez_post seems to suggest that i should be able to append the ids to an existing web_history object, and then the entire web_history object could be downloaded a single entrez fetch call. I tried this with the below code, however in this case entrez_fetch only downloads the last chunk of 100 ids i uploaded:

#create new webhistory object
upload <- entrez_post(db="nuccore", id=ids_chunked[[1]])
#Add to web history object
for (l in 2:length(ids_chunked)) {
  upload <- entrez_post(db="nuccore", id=ids_chunked[[l]], web_history=upload)
}

dl <- entrez_fetch(db = "nuccore", web_history = upload, rettype = "fasta", retmax = 10000)
cat(dl, file="out.fa", append=FALSE)

Do you have any input on what i am doing wrong here, or suggestions on better ways to do this (i.e. can i somehow subset the webhistory object directly on the NCBI server without having to post the ids again?)

Cheers, Alex

ropensci / rentrez

Subsampling search results #144