ropensci / rentrez

talk with NCBI entrez using R
https://docs.ropensci.org/rentrez
Other
194 stars 38 forks source link

Random sample of large set of search results #194

Open hlappen opened 3 months ago

hlappen commented 3 months ago

I'm working on a project where I want to extract a few pieces of information from the xml of PubMed records. I've managed to do this on a smaller scale (~1k records).

The problem is that I need to do this on much larger sets of records. My search results are about 600k and I'd like to be able to get a random sample of those (at least 15%) to extract (since the full set might be overkill). The trouble I'm running into is getting information on the full list of the results from which to generate a random sample. I am able to get a list of all the ids up to the retmax limit of 10k, but I can't figure out how to get the rest of them.

I know I can use fetch on the web history and iterate through the set in batches, but that would mean getting the whole xml record of all 600k articles just the get the PMIDs, at which point I might as well just extract all the info I want from them all. I've also considered breaking the searches up to be smaller sets of results, but the best I can do is to limit by year and they are still about 50k-80k.

As a simple (and small) example,

search_query <- "clowns AND hospital AND randomized" 
clowns <- entrez_search(db = "pubmed", term = search_query, use_history=TRUE)

This returns 37 results as part of the web history and 20 due to the retmax limit. Now, I know that I can increase that limit for this example in order to get all 37, but I can't for the larger searches I'm doing. So, for the sake of the example, let's assume that's the limit.

I can get the 20 ids using clowns$ids, but is there a way to efficiently get a list of all the 37 ids from the web history, so I could do something likesample_ids <- sample(clowns$ids, 30)?

Probably not important, but I'm grabbing pmid, indexing method, last modification date, and the mesh terms from each record and putting them into a data.frame.

allenbaron commented 2 months ago

rentrez is a wrapper for the Entrez Utilities published by NCBI that covers most of the functionality of those tools. It would probably be best to ask this question to NCBI directly. If you do, please consider posting their response here for others. Questions like this are fairly common in rentrez's issues section and it would be helpful for others to see this information.