ropensci / rentrez

talk with NCBI entrez using R
https://docs.ropensci.org/rentrez
Other
195 stars 38 forks source link

Seaching pubmed by secondary id [SI] #102

Closed chackoge closed 7 years ago

chackoge commented 7 years ago

I'd like to use the method of Huser et al. (http://dx.doi.org/10.1371/journal.pone.0068409) with rentrez to submit a list of uids (secondary ids- si) from the National Clinical Trials database to get back linked pmids, e.g.,

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed & term=NCT00000419 [si]

entrez_search works fine for a single term

entrez_search(db="pubmed",term="NCT00137423[SI]")

but if I submit a vector of search terms, e.g.,

nct_list <- entrez_search(db="pubmed",term=d1$V1) I get the following error

Error in vapply(elements, encode, character(1)) : values must be length 1, but FUN(X[[2]]) result is length 44

presumably because there are multiple pmids for each nct_id? Is there a way to get around this issue without a for loop? Thanks

dwinter commented 7 years ago

Hi @chackoge ,

The terms has to be a one-length character. If it made sense you could paste the IDs into a single term (something like)

SIs <- c("NCT00137423", "NCT00137424", "NCT00137425")
sterm  <- paste(paste0(SIs, "[SI]"), collapse=" OR ")
sterm
[1] "NCT00137423[SI] OR NCT00137424[SI] OR NCT00137425[SI]"

But that would lose the mapping from search term to the pubmed IDs.

So if sounds like you may need to use a loop/apply family function across your SIs. (If you have a large list of IDS that may take a long time, as the NCBI limits requests to 3 per second, and rentrez enforces this).

chackoge commented 7 years ago

I typically have 20-50 SIs of which around ~50% have between 1-5 associated pmids so I should be OK if I'm careful. I use rentrez a lot so thanks for developing it.

dwinter commented 7 years ago

Thanks @chackoge , always happy to hear people get something from rentrez :smile:

Closing the issue now, but feel free to add to the thread or open a new one if something comes up.

chackoge commented 7 years ago

@dwinter Perhaps you could consider a feature request for rentrez? That is, the ability to search pubmed by secondary ID using a vector of secondary ids as opposed to a one-length character. Example...

nct_list <- entrez_search(db="pubmed",si_id=vec)

where vec is a vector of ids exemplified by "NCT00137423[SI]" Thanks

dwinter commented 7 years ago

Hi @chackoge ,

I am a little bit wary of having functions that return either a list or a single record depending on the input (for a long time entrez_summary had unusual behaviour/bugs because of this).

I think in this case you can set the same effect without too much extra work? Something like

nct_list <- lapply(sterms, entrez_search, db="pubmed")
#just the IDs
pmids_list <- lapply(sterms, function(s) entrez_search( db="pubmed", term=s)$ids)
chackoge commented 7 years ago

OK- I thought I'd ask but I see why you're wary. I came up with a rather clunky solution for my use case using Base R. I have no doubt that it could be made more elegant but it works for me, which is what is important in the moment.

nct_pmid_cleanupF <- function(x) {

# takes an input list of NCT IDs as a vector, e.g. x or df$x,  
# interrogates PubMed using the NCST ID as a secondary ID source [SI]
# then returns a list of lists, which is clunkily processed as below.

library(rentrez)
nct = list()
for (i in 1:length(x)) {
    dat <- entrez_search(db="pubmed",term=paste(x[i],"[SI]",sep=""))
    dat$i <- i  
    nct[[i]] <- dat 
}

# simplify to elements of interest
nct2 <- list()
for (i in 1:length(nct)) {
nct2[[i]] <- data.frame(cbind(rep(nct[[i]][["QueryTranslation"]],length(nct[[i]][["ids"]])),nct[[i]][["ids"]]),stringsAsFactors=FALSE)
}

# clean up
nct3 <- do.call(rbind,nct2)
nct3$X1 <- gsub("\\[SI\\]$","",nct3$X1)
colnames(nct3) <- c("nct_id","pmid1")
nct3 <- unique(nct3)

return (nct3)
}
lellean commented 5 years ago

Thanks for this thread! I used nct_list <- lapply(sterms, entrez_search, db="pubmed")

just the IDs

pmids_list <- lapply(sterms, function(s) entrez_search( db="pubmed", term=s)$ids)

and it worked well with defined sterms that is a relatively short list ( less than 150) .

In attempting to return list of pmids for a list of 247 nct id i get an error : he request is too large. For large requests, try using web history as described in the rentrez tutorial

When I tried to add date to the sterms either to each nct if or as an add on error persists.

Any advice on how to filter for most recent years? I also tried the use.history= TRUE option.