Closed gorkang closed 6 years ago
thanks @gorkang for this.
Good question. I'm not exactly sure how to get it.
Ive asked the ORCID devs and will get back to you.
@gorkang got a reply - will reply back soon with a solution that works here
@gorkang this is a solution. thoughts?
id <- "0000-0001-7678-8656"
x <- orcid_works(id)
vapply(x[[1]]$group$`work-summary`, function(z) {
orcid_works(id, put_code = z$`put-code`)[[1]]$citation$`citation-value`
}, "")
[1] "@article{Navarrete2017,title = {Editorial: The reasoning brain: The interplay between cognitive neuroscience and theories of reasoning},journal = {Frontiers in Human Neuroscience},year = {2017},volume = {10},author = {Goel, V. and Navarrete, G. and Noveck, I.A. and Prado, J.}}"
[2] "@article{Navarrete2017,title = {Preference for Curvilinear Contour in Interior Architectural Spaces: Evidence From Experts and Nonexperts},journal = {Psychology of Aesthetics, Creativity, and the Arts},year = {2017},author = {Vartanian, O. and Navarrete, G. and Chatterjee, A. and Fich, L.B. and Leder, H. and Cristi{\\'a}n, M. and Rostrup, N. and Skov, M. and Corradi, G. and Nadal, M.}}"
[3] "@article{Navarrete2017,title = {Social cognition and executive functions as key factors for effective pedagogy in higher education},journal = {Frontiers in Psychology},year = {2017},volume = {8},number = {NOV},author = {Correia, R. and Navarrete, G.}}"
... etc.
Thanks for the prompt response @sckott !
OK, so as far as I understand, when you call orcid_works() for a single record and include the put-code, it returns the citation information. Awesome!
My worry with the new system is that now I would need two calls to orcid to accomplish what I was doing with only one with the previous version. I did a very basic benchmarking to see the potential consequences.
I used two different profiles:
I switched to purrr::map because the vapply() code was giving me trouble for the complex profile (the time difference between the vapply() and the map() version for the simple profile seems negligible). The code is:
# Orcid 0.3.0 -------------------------------------------------------------
# require(devtools)
# install_version("rorcid", version = "0.3.0", repos = "http://cran.us.r-project.org")
library(rorcid)
library(dplyr)
library(purrr)
ids = c("0000-0001-7678-8656", "0000-0001-6758-5101")
benchmark_orcid <- function(id) {
tictoc::tic()
df_3[[id]] <<- works(orcid_id(id))$data
cat(id, " - ")
tictoc::toc()
}
df_3 = list()
ids %>% map(~ benchmark_orcid(.x)) %>% invisible()
# Orcid 0.4.0 -------------------------------------------------------------
# require(devtools)
# install_version("rorcid", version = "0.4.0", repos = "http://cran.us.r-project.org")
library(rorcid)
library(dplyr)
library(purrr)
ids = c("0000-0001-7678-8656", "0000-0001-6758-5101")
benchmark_orcid <- function(id) {
tictoc::tic()
df_4[[id]] <<- rorcid::orcid_works(id)[[1]]$group$`work-summary` %>%
purrr::map(~ orcid_works(id, put_code = .x$`put-code`))
cat(id, " - ")
tictoc::toc()
}
df_4 = list()
ids %>% map(~ benchmark_orcid(.x)) %>% invisible()
Leaving aside the inconsistency with # of records in the second profile (probably not important for the purposes of this issue), the results are:
ORCID_ID | Works | Time (s) | Version |
---|---|---|---|
0000-0001-7678-8656 | 24 | 1.1 | 0.3.0 |
0000-0001-6758-5101 | 326 | 1.9 | 0.3.0 |
0000-0001-7678-8656 | 24 | 5 | 0.4.0 |
0000-0001-6758-5101 | 225 | 58 | 0.4.0 |
As you can see, there is a huge penalty with the new code/version, particularly for large profiles.
Not sure how hard would be to add the detailed (citation, etc.) info as a nested list inside the group list in the orcid_works() call, alongside the work-summary list), or if there is any other potential solution you can think of.
Thanks again for the help!
thanks for the detailed look at this.
short answer: ORCID will retire the older API soon, so there's no option of going back to it. They decided for whatever reason to move more detailed citations out of responses for a given ORCID ID, and require one to request that information for each individual work - so we're sort of stuck with this. We can of course try to optimize the code to make it as fast as possible.
Thanks for letting me know.
Do you know if we(I) can pass the above info along to them? Anyone specifically? Here? I would love to try to understand the logic and see if there is any way to get the details for all the works of a researcher in a single call.
I am implementing a pilot system to automatically check the publications of a small group of researchers using ORCID. With the previous system, I was making at most 12 calls each time, and now I have to go up to >300. If all goes well, we might try it university-wide, but with the new api looks less feasible.
Cheers!
This is where I asked about it https://groups.google.com/forum/#!topic/orcid-api-users/xf06IxQqSbY so you can maybe comment there to ask them.
I want to help you make it feasible to use this, we'll try to make it happen.
Hi Guys, It's just not possible with standard HTTP calls when you factor in hyper-authors (defined as having 1000s of works with 1000s of contributors).
Here is the counts of number of works grouped by ORCID: 48593 21400 19454 6978 6348 4962 4367 4202 4107 3772 3350 3323 3308 3168 3149 3085 3081 2876 2728 2722 2461 2457 2373 2366 2336 2296 2275 2261 2221 2212 2206 2196 2111 2049 2047 2036 2029 2019 2015 2014
As the thread points out you can resolve 50 works at a time. https://groups.google.com/forum/#!topic/orcid-api-users/xf06IxQqSbY
Thanks, Rob
On Thu, Feb 15, 2018 at 1:45 PM, Scott Chamberlain <notifications@github.com
wrote:
This is where I asked about it https://groups.google.com/ forum/#!topic/orcid-api-users/xf06IxQqSbY so you can maybe comment there to ask them.
I want to help you make it feasible to use this, we'll try to make it happen.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ropensci/rorcid/issues/44#issuecomment-366071912, or mute the thread https://github.com/notifications/unsubscribe-auth/AALx5I_FVLHe2k9kXI_1ylNeUZJivP91ks5tVKWBgaJpZM4SF_Xk .
-- Robert Peters info@rcpeters.com 805.440.9056
thanks @rcpeters - forgot about that already
@gorkang i'll try the up to 50 put codes and see how that affects timing.
@gorkang made a change in orcid_works
- reinstall devtools::install_github("ropensci/rorcid")
and try e.g, https://github.com/ropensci/rorcid/blob/master/R/orcid_works.R#L44-L47
Thanks @sckott and @rcpeters ! I really appreciate your help.
If it is possible to get the detailed works (including citations) of a researcher 50 at a time, that would be awesome! 1-3 calls should be enough for most researchers.
Not sure if there is a filter parameter we could use (e.g. filter = c(publication-date.year.value > 2014))
directly in the orcid_works() call.
Alternatively, we can filter the first orcid_works(id) call to minimize the number of records we ask for. That pulls down the two examples above to 2.5 and 18 seconds (see code). Still a lot, but with the 50/call implementation, time should go down dramatically.
# ORCID 0.4.0 Filtered ----------------------------------------------------
# require(devtools)
# install_version("rorcid", version = "0.4.0", repos = "http://cran.us.r-project.org")
library(rorcid)
library(dplyr)
library(purrr)
ids = c("0000-0001-7678-8656", "0000-0001-6758-5101")
id = c("0000-0001-7678-8656")
benchmark_orcid <- function(id) {
tictoc::tic()
df_4[[id]] <<- rorcid::orcid_works(id)[[1]]$group$`work-summary` %>%
bind_rows() %>%
filter(`publication-date.year.value` > 2014) %>%
select(`put-code`) %>% # Ge only ask for the records we need to minimize # of calls.
purrr::map(~ orcid_works(id, put_code = .x)) #print(.x))#
cat(id, " - ")
tictoc::toc()
}
df_4 = list()
ids %>% map(~ benchmark_orcid(.x)) %>% invisible()
The only filtering that might be possilbe is with the orcid()
function - e.g., can do orcid('work-titles:Modern developments in holography and its materials')
, but i don't think they expose dates of works in that API route. @rcpeters ? But even if you could search with a date filter like that orcid()
only retuns the ORCID IDs, and not the works themeselves.
The solution that seems best is to filter as you do above in your code with dplyr or similar
@sckott, that improves things a lot. With the short profile time goes down to < 1s.
But, with the long profile, for some reason, your code fails... (see code below)
benchmark_orcid <- function(id) {
tictoc::tic()
x <- orcid_works(id)
pcodes <- vapply(x[[1]]$group$`work-summary`, "[[", 1, "put-code")
length(pcodes)
res <- orcid_works(id, put_code = pcodes)
res[[1]]$bulk$`work.citation.citation-value`
cat(id, " - ")
tictoc::toc()
}
# Down to <1s!
id = "0000-0001-7678-8656"
benchmark_orcid(id)
# ERROR:
# Error in vapply(x[[1]]$group$`work-summary`, "[[", 1, "put-code") :
# values must be length 1,
# but FUN(X[[67]]) result is length 2
id = "0000-0001-6758-5101"
benchmark_orcid(id)
Do you think splitting longer than 50 put-codes in 50-item chunks 50 could be implemented directly in the orcid_works() function?
okay, reinstall again
Awesome!
0000-0001-7678-8656 - 1.096 sec elapsed 0000-0001-6758-5101 - 1.57 sec elapsed
Sorry for nitpicking... and please, do ask me to go to hell if you feel like it. It works great, the only problem being that the structure of the list changes depending on the number of put-codes blocks:
It could be possible for the bulk call to have the same structure, combining all the chunks in a single bulk list? (as in the 0000-0001-7678-8656 list above).
ORDIC-ID --- put-code --- ORCID-ID --- bulk
Thanks again!
does that work better?
id <- "0000-0001-6758-5101"
x <- orcid_works(id)
pcodes <- unlist(lapply(x[[1]]$group$`work-summary`, "[[", "put-code"))
res <- orcid_works(id, put_code = pcodes)
df <- tibble::as_tibble(
data.table::setDF(data.table::rbindlist(res, use.names = TRUE, fill = TRUE)))
#> # A tibble: 326 x 29
#> `work.put-code` work.path `work.short-descri… work.type `work.language-c… work.country work.visibility
#> <int> <chr> <lgl> <chr> <lgl> <lgl> <chr>
#> 1 40305375 /0000-0001-6758-5… NA JOURNAL_AR… NA NA PUBLIC
#> 2 39332636 /0000-0001-6758-5… NA JOURNAL_AR… NA NA PUBLIC
#> 3 39349243 /0000-0001-6758-5… NA JOURNAL_AR… NA NA PUBLIC
#> 4 37133164 /0000-0001-6758-5… NA JOURNAL_AR… NA NA PUBLIC
#> 5 35201165 /0000-0001-6758-5… NA JOURNAL_AR… NA NA PUBLIC
#> 6 33165074 /0000-0001-6758-5… NA JOURNAL_AR… NA NA PUBLIC
#> 7 35000831 /0000-0001-6758-5… NA JOURNAL_AR… NA NA PUBLIC
#> 8 34115375 /0000-0001-6758-5… NA JOURNAL_AR… NA NA PUBLIC
#> 9 33174310 /0000-0001-6758-5… NA JOURNAL_AR… NA NA PUBLIC
#> 10 29117679 /0000-0001-6758-5… NA JOURNAL_AR… NA NA PUBLIC
#> # ... with 316 more rows, and 22 more variables: `work.created-date.value` <dbl>,
#> # `work.last-modified-date.value` <dbl>, `work.source.source-orcid` <lgl>,
#> # `work.source.source-client-id.uri` <chr>, `work.source.source-client-id.path` <chr>,
#> # `work.source.source-client-id.host` <chr>, `work.source.source-name.value` <chr>, work.title.subtitle <lgl>,
#> # `work.title.translated-title` <lgl>, work.title.title.value <chr>, `work.journal-title.value` <chr>,
#> # `work.citation.citation-type` <chr>, `work.citation.citation-value` <chr>,
#> # `work.publication-date.media-type` <lgl>, `work.publication-date.year.value` <chr>,
#> # `work.publication-date.month.value` <chr>, `work.publication-date.day.value` <chr>,
#> # `work.external-ids.external-id` <list>, work.url.value <chr>, work.contributors.contributor <list>,
#> # `work.publication-date.day` <lgl>, work.title.subtitle.value <chr>
then
df$`work.citation.citation-value`
to get all citaitons
Mmmm... weird. That last line gives me a:
Error: Column
bulk
must be a 1d atomic vector or a list
what line?
and did you reinsatll?
I reinstalled but it seems I didn't restart property the R session :S
Anyway, yes! It works! I will try to check everything with a bit more time tomorrow and let you know how it goes.
Thanks again for all the help.
Great, glad it works.
I had some time this morning to take a closer look.
Timewise things are really awesome now.
0000-0001-7678-8656 - 0.961 sec elapsed 0000-0001-6758-5101 - 1.209 sec elapsed
The only remaining issue is that orcid_works()
returns lists with different structures depending on the number of records (<= 50 vs > 50). See code below ( # 3. Both simultaneously).
As you can see, the simple profile list ("0000-0001-7678-8656") has a different structure to the complex profile list ("0000-0001-6758-5101"). Turning any of those to a df is trivial (see # 1. Simple profile and # 2. Complex profile below), but turning both at the same time not so much (see # 3. Both simultaneously).
orcid_works()
could return a list with the same structure regardless of the number of works. Something like [[ORCID-ID]][[works]] or similar (?). That would make much easier to then combine multiple profiles in a single DF. This is the code I am using:
# devtools::install_github("ropensci/rorcid")
library(rorcid)
library(dplyr)
library(purrr)
# Function ----------------------------------------------------------------
benchmark_orcid <- function(id) {
list_orcid <- list()
tictoc::tic()
# Get put-codes
put_codes = rorcid::orcid_works(id)[[1]]$group$`work-summary` %>%
bind_rows() %>%
filter(`publication-date.year.value` > 2014) %>% # we only ask for the records we need to minimize # of calls.
pull(`put-code`)
# Get info of those put codes
list_orcid[[id]] <- orcid_works(id, put_code = put_codes) #%>% bind_rows()
cat(id, " - ")
tictoc::toc()
list_orcid
}
# Fetching works ----------------------------------------------------------
# 1. Simple profile (< 50 works)
list_simple = benchmark_orcid("0000-0001-7678-8656")
df_simple = list_simple[[1]][[1]][[1]]
# 2. Complex profile ( > 50 works)
list_complex = benchmark_orcid("0000-0001-6758-5101")
df_complex = list_complex[[1]] %>% bind_rows()
# 3. Both simultaneously
ids = c("0000-0001-7678-8656", "0000-0001-6758-5101")
list_multiple = ids %>% map(~ benchmark_orcid(.x)) %>% invisible()
# df_multiple = ?
Thanks!
thx, will have a look
@gorkang I think it's improved now. can you reinstall and try again
the structure has changed a bit, so here's your modified function:
benchmark_orcid <- function(id) {
tictoc::tic()
# Get put-codes
put_codes = orcid_works(id)[[1]] %>%
bind_rows() %>%
filter(`publication-date.year.value` > 2014) %>% # we only ask for the records we need to minimize # of calls.
pull(`put-code`)
# Get info of those put codes
list_orcid <- orcid_works(id, put_code = put_codes) #%>% bind_rows()
cat(id, " - ")
tictoc::toc()
list_orcid
}
Thanks for taking a look @sckott . Things have indeed improved.
Now we have a common structure in the first two levels:
Profiles with < 51 publications: [[1]][[ORCID-ID]][[1]]
Profiles with > 50 publications: [[1]][[ORCID-ID]][[1]] and [[1]][[ORCID-ID]][[2]], etc. (one extra list for each 50 entries, so, a profile with 140 pubs would have [[1]] & [[2]] & [[3]] at the third level)
With this structure, after fetching the data of multiple researchers:
ids = c("0000-0001-7678-8656", "0000-0001-6758-5101")
list_multiple = ids %>% map(~ benchmark_orcid(.x)) %>% invisible()
I can combine all the researchers in a single DF, independently of the number of publications they have:
df_combined = 1:length(list_multiple) %>% # For each researcher
map( ~ list_multiple[[.x]][[1]] %>% bind_rows()) %>% # Create one DF for each researcher
bind_rows() # Bind all researchers
Not sure if I am nitpicking and how easy would be to go one step further (please, feel free to close the issue if you like, as things are good enough now). But, ideally, it would be nice if the structure was exactly the same independently of the number of publications of the profile. Something like:
Thanks again for all the help.
try again after reinstalling @gorkang
This is great. It works now as expected. Thanks!
Now I can combine the lists using:
df_combined = do.call(rbind, list_multiple %>% flatten()) %>% bind_rows()
With the rorcid 0.3.0 version I could run:
And get the citation information for all the works from the column
work-citation.citation
.In the 0.4.0 version when I get the works, lots of the old columns are gone, including the citation:
There is any way to get the complete ORCID works info with the new rorcid package?
PS: Thanks for bringing up to speed the rorcid package with the orcid_fundings() and all the other new functions!