muschellij2 / rscopus

Scopus Database API Interface to R
75 stars 16 forks source link

Downloading more than the Scopus quota's #12

Closed Gerwi closed 5 years ago

Gerwi commented 5 years ago

Thanks for developing this package. It has been functioning perfectly so far. However, I have the following issue. My current search via the Scopus search function indicates that there are about 80k hits. Since the quota for this API is 20.000 publications per week, I can't download them all at once. I was wondering if there is a way to continue the download next week (when Elsevier will reset the quota's) from publication 20.001 till 40.000, and after waiting another week downloading 40.001-60.000.

muschellij2 commented 5 years ago

Thanks - I can't really work around that quota for now. Also, make sure you have your API registered for research if that's the purpose as Scopus mandates this I believe.

Can you show me an example of the query or what you want to download? I can't debug or help without specifics.

Gerwi commented 5 years ago

Ok, hope that they will provide this in the future, but I am not that optimistic.

For my case I was trying to retrieve all publications of a small country via the scopus_search API.

For now I have changed my strategy and am planning to use PubmedIDs to circumvent this issue.

muschellij2 commented 5 years ago

Here's some basic code I think gets you the most of the way:

library(rscopus)
au_ids = c(23480260200, 8708052900, 54896131300, 
    55570070100,
    55479219200, 7409391345, 55500593700, 39362440900)
# get all the data for the authors (including all co-authors)
res = lapply(au_ids, author_data)
names(res) = au_ids

# get co-authors
all_authors = lapply(res, function(x) {
    x$full_data$author
})

# get unique IDs for those authors
unique_authors = lapply(all_authors, function(x) {
    unique(x$authid)
})

# collapse all authors together
combined_authors = unlist(unique_authors)
combined_authors = unique(combined_authors)
# don't need the original authors in there
combined_authors = setdiff(combined_authors, au_ids)

# just doing first 5 due to API limits (but you can run these in chunks)
run_authors = combined_authors[1:5]
all_author_res = lapply(
    run_authors, 
    author_data,
    count = 200, view = "STANDARD")
names(all_author_res) = run_authors
all_author_res[[1]]$df
Gerwi commented 5 years ago

Thanks a lot for this suggestion.

For know I am trying the following approach, starting with a csv file containing pubmedIDs:

#Cut pubmed_list in small parts (large search request are not handled by the API, and smaller can also maximize the utilization of the weekly quota's. Subsequently a search_string is created in the format: "PMID(123456 OR 123457)" which can be handled by the scopus_search. The output object of this are subsequently stored in a list. 

chunk <- 5
n <- nrow(pubmed_list)
r  <- rep(1:ceiling(n/chunk),each=chunk)[1:n]
d <- split(pubmed_list,r)

res_list <- list()
for (number in range(1:4)){
    names(d)[names(d)==number]<-"string"
    string = data.frame(d$string[[1]])
    string$OR = " OR "
    names(string) = c("pmid","OR")
    string = paste0(string$pmid,string$OR)
    string = substr(string,1,nchar(string)-4)
    string = paste0("PMID(",string,")")
    res=scopus_search(query=string, view="COMPLETE", max_count = 1)
    res_list[[number]] <- res
    names(d)[names(d)=="string"]<-number}

OT: Not related to this package, but nevertheless worth mentioning, some articles are included twice in Elsevier. For example: PMID(30428293)

muschellij2 commented 5 years ago

Please follow up with the duplicates with Elsevier/Scopus.

You seem to have changed your goal. I gave the solution I feel you requested. I have provided the tools, but I don't have any other info in these things and you can open another issue for "Quota" limits, but otherwise this is a scripting question and not development question and am closing.

muschellij2 commented 5 years ago

OK - where are you seeing the 80k limit?

muschellij2 commented 5 years ago

I think PubMed IDs may cause some problems as I've seen them not return results given my permission for my API key: "API key in this example was setup with authorized CORS domains." as I've tried on interactive APIs: https://dev.elsevier.com/interactive.html

muschellij2 commented 5 years ago

Such as PMID(30391859): https://www.ncbi.nlm.nih.gov/pubmed/30391859, but PMID(30391859) in scopus search gets nothing.

Gerwi commented 5 years ago

Just as clarification, since I am not encountering issues anymore(so it can remain closed), my current strategy is to search for a particular disease, for example: "Heart Defects, Congenital"[Mesh], download the list with pubmedIDs and save them in a csv, and cut them in chuncks. Subsequently a for loop is used to transform the IDs to a search string and the output dataframes are stored in lists, which are rbind to dataframes. The for loop breaks automatically when the quota is reached.

`pubmed_list_diabetes<-read.csv("pubmed_list_diabetes.csv")

rm(d)
chunk <- 1
n <- nrow(pubmed_list)
r  <- rep(1:ceiling(n/chunk),each=chunk)[1:n]
d <- split(pubmed_list,r)

publications_list = list()
affiliations_list = list()
authors_list = list()
remaining = 10
for (number in 1:20){
    if (remaining < chunk){
    break }
    names(d)[names(d)==number]<-"string"
    string = data.frame(d$string[[1]])
    string$OR = " OR "
    names(string) = c("pmid","OR")
    string = paste0(string$pmid,string$OR)
    string = substr(string,1,nchar(string)-4)
    string = paste0("PMID(",string,")")
    res=scopus_search(query=string, view="COMPLETE", max_count = 1)
    entries = gen_entries_to_df(res$entries)
    entries$df$entry_number2=paste0(number,".",entries$df$entry_number)
    publications_list[[number]] = entries$df
    entries$affiliation$entry_number2=paste0(number,".",entries$affiliation$entry_number)
    affiliations_list[[number]] = entries$affiliation
    entries$author$entry_number2=paste0(number,".",entries$author$entry_number)
    authors_list[[number]] = entries$author
    names(d)[names(d)=="string"]<-number
    remaining=res$get_statements$headers$`x-ratelimit-remaining`}` 

Such as PMID(30391859): https://www.ncbi.nlm.nih.gov/pubmed/30391859, but PMID(30391859) in scopus search gets nothing.

The searches returning no articles can be due to two reasons:

The first reason is not solvable, but the second one can be quite easily corrected, by downloading from PubMed a csv linking titles to PubMedIDs (which can be used to search by using the titles for the articles that retrieve no result when searching for their PubMedIDs).