Downloading more than the Scopus quota's

Gerwi commented 5 years ago

Thanks for developing this package. It has been functioning perfectly so far. However, I have the following issue. My current search via the Scopus search function indicates that there are about 80k hits. Since the quota for this API is 20.000 publications per week, I can't download them all at once. I was wondering if there is a way to continue the download next week (when Elsevier will reset the quota's) from publication 20.001 till 40.000, and after waiting another week downloading 40.001-60.000.

muschellij2 commented 5 years ago

Thanks - I can't really work around that quota for now. Also, make sure you have your API registered for research if that's the purpose as Scopus mandates this I believe.

Can you show me an example of the query or what you want to download? I can't debug or help without specifics.

Gerwi commented 5 years ago

Ok, hope that they will provide this in the future, but I am not that optimistic.

For my case I was trying to retrieve all publications of a small country via the scopus_search API.

For now I have changed my strategy and am planning to use PubmedIDs to circumvent this issue.

muschellij2 commented 5 years ago

Here's some basic code I think gets you the most of the way:

library(rscopus)
au_ids = c(23480260200, 8708052900, 54896131300, 
    55570070100,
    55479219200, 7409391345, 55500593700, 39362440900)
# get all the data for the authors (including all co-authors)
res = lapply(au_ids, author_data)
names(res) = au_ids

# get co-authors
all_authors = lapply(res, function(x) {
    x$full_data$author
})

# get unique IDs for those authors
unique_authors = lapply(all_authors, function(x) {
    unique(x$authid)
})

# collapse all authors together
combined_authors = unlist(unique_authors)
combined_authors = unique(combined_authors)
# don't need the original authors in there
combined_authors = setdiff(combined_authors, au_ids)

# just doing first 5 due to API limits (but you can run these in chunks)
run_authors = combined_authors[1:5]
all_author_res = lapply(
    run_authors, 
    author_data,
    count = 200, view = "STANDARD")
names(all_author_res) = run_authors
all_author_res[[1]]$df

Gerwi commented 5 years ago

Thanks a lot for this suggestion.

For know I am trying the following approach, starting with a csv file containing pubmedIDs:

#Cut pubmed_list in small parts (large search request are not handled by the API, and smaller can also maximize the utilization of the weekly quota's. Subsequently a search_string is created in the format: "PMID(123456 OR 123457)" which can be handled by the scopus_search. The output object of this are subsequently stored in a list. 

chunk <- 5
n <- nrow(pubmed_list)
r  <- rep(1:ceiling(n/chunk),each=chunk)[1:n]
d <- split(pubmed_list,r)

res_list <- list()
for (number in range(1:4)){
    names(d)[names(d)==number]<-"string"
    string = data.frame(d$string[[1]])
    string$OR = " OR "
    names(string) = c("pmid","OR")
    string = paste0(string$pmid,string$OR)
    string = substr(string,1,nchar(string)-4)
    string = paste0("PMID(",string,")")
    res=scopus_search(query=string, view="COMPLETE", max_count = 1)
    res_list[[number]] <- res
    names(d)[names(d)=="string"]<-number}

OT: Not related to this package, but nevertheless worth mentioning, some articles are included twice in Elsevier. For example: PMID(30428293)

muschellij2 commented 5 years ago

Please follow up with the duplicates with Elsevier/Scopus.

You seem to have changed your goal. I gave the solution I feel you requested. I have provided the tools, but I don't have any other info in these things and you can open another issue for "Quota" limits, but otherwise this is a scripting question and not development question and am closing.

muschellij2 commented 5 years ago

OK - where are you seeing the 80k limit?

muschellij2 commented 5 years ago

I think PubMed IDs may cause some problems as I've seen them not return results given my permission for my API key: "API key in this example was setup with authorized CORS domains." as I've tried on interactive APIs: https://dev.elsevier.com/interactive.html

muschellij2 commented 5 years ago

Such as PMID(30391859): https://www.ncbi.nlm.nih.gov/pubmed/30391859, but PMID(30391859) in scopus search gets nothing.

Gerwi commented 5 years ago

Just as clarification, since I am not encountering issues anymore(so it can remain closed), my current strategy is to search for a particular disease, for example: "Heart Defects, Congenital"[Mesh], download the list with pubmedIDs and save them in a csv, and cut them in chuncks. Subsequently a for loop is used to transform the IDs to a search string and the output dataframes are stored in lists, which are rbind to dataframes. The for loop breaks automatically when the quota is reached.

`pubmed_list_diabetes<-read.csv("pubmed_list_diabetes.csv")

rm(d)
chunk <- 1
n <- nrow(pubmed_list)
r  <- rep(1:ceiling(n/chunk),each=chunk)[1:n]
d <- split(pubmed_list,r)

publications_list = list()
affiliations_list = list()
authors_list = list()
remaining = 10
for (number in 1:20){
    if (remaining < chunk){
    break }
    names(d)[names(d)==number]<-"string"
    string = data.frame(d$string[[1]])
    string$OR = " OR "
    names(string) = c("pmid","OR")
    string = paste0(string$pmid,string$OR)
    string = substr(string,1,nchar(string)-4)
    string = paste0("PMID(",string,")")
    res=scopus_search(query=string, view="COMPLETE", max_count = 1)
    entries = gen_entries_to_df(res$entries)
    entries$df$entry_number2=paste0(number,".",entries$df$entry_number)
    publications_list[[number]] = entries$df
    entries$affiliation$entry_number2=paste0(number,".",entries$affiliation$entry_number)
    affiliations_list[[number]] = entries$affiliation
    entries$author$entry_number2=paste0(number,".",entries$author$entry_number)
    authors_list[[number]] = entries$author
    names(d)[names(d)=="string"]<-number
    remaining=res$get_statements$headers$`x-ratelimit-remaining`}`

Such as PMID(30391859): https://www.ncbi.nlm.nih.gov/pubmed/30391859, but PMID(30391859) in scopus search gets nothing.

The searches returning no articles can be due to two reasons:

PubMed covers more medical journals than scopus
Some PubMed articles are in scopus, but without their PubMedID, probably due to that indexing at PubMed and Scopus does not happen at the same time (for example, you will find the article PMID(30391859) in scopus by searching on it's title "A dual modeling approach to automatic segmentation of cerebral T2 hyperintensities and T1 black holes in multiple sclerosis".

The first reason is not solvable, but the second one can be quite easily corrected, by downloading from PubMed a csv linking titles to PubMedIDs (which can be used to search by using the titles for the articles that retrieve no result when searching for their PubMedIDs).

muschellij2 / rscopus

Downloading more than the Scopus quota's #12