ropensci-archive / rorcid

:warning: ARCHIVED :warning: A programmatic interface the Orcid.org API
Other
109 stars 13 forks source link

orcid_peer_reviews() not getting journal name? #52

Closed gorkang closed 5 years ago

gorkang commented 6 years ago

I am trying to get the peer review activity from ORCID profiles using orcid_peer_reviews(). Everything seems to work fine, but I cannot find the journal names of the reviews.

For example, to get the following review from an ORCID profile... screenshot from 2018-04-08 10-54-28

I use the code below, but the closest I can get to the journal name is through the publons website URL. I can't see it in the general orcid_peer_reviews(id) or the orcid_peer_reviews(id, pur_code) calls.

id = "0000-0001-7678-8656"

# Get reviews  
temp_reviews = orcid_peer_reviews(id)[[1]]

# Get details of specific review
temp_reviews_2 = orcid_peer_reviews(id, put_code = "220419")[[1]]

# Using the publons website I can get to the journal name, but I'd need to scrap it or similar...
temp_reviews_2$`review-identifiers`$`external-id`$`external-id-url.value`

Below the session details.

Session info --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.4.4 (2018-03-15)
 system   x86_64, linux-gnu           
 ui       RStudio (1.1.442)           
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       America/Santiago            
 date     2018-04-08                  

Packages ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 package    * version    date       source                          
 assertthat   0.2.0      2017-04-11 CRAN (R 3.4.1)                  
 backports    1.1.2      2017-12-13 CRAN (R 3.4.3)                  
 base       * 3.4.4      2018-03-16 local                           
 bindr        0.1.1      2018-03-13 CRAN (R 3.4.3)                  
 bindrcpp     0.2        2017-06-17 CRAN (R 3.4.1)                  
 bookdown     0.7        2018-02-18 CRAN (R 3.4.3)                  
 compiler     3.4.4      2018-03-16 local                           
 crul         0.5.2      2018-02-24 CRAN (R 3.4.3)                  
 curl         3.1        2017-12-12 CRAN (R 3.4.3)                  
 datasets   * 3.4.4      2018-03-16 local                           
 devtools     1.13.5     2018-02-18 CRAN (R 3.4.3)                  
 digest       0.6.15     2018-01-28 CRAN (R 3.4.3)                  
 dplyr      * 0.7.4      2017-09-28 CRAN (R 3.4.2)                  
 evaluate     0.10.1     2017-06-24 CRAN (R 3.4.1)                  
 glue         1.2.0      2017-10-29 CRAN (R 3.4.2)                  
 graphics   * 3.4.4      2018-03-16 local                           
 grDevices  * 3.4.4      2018-03-16 local                           
 htmltools    0.3.6      2017-04-28 CRAN (R 3.4.1)                  
 httr         1.3.1      2017-08-20 CRAN (R 3.4.1)                  
 jsonlite     1.5        2017-06-01 CRAN (R 3.4.1)                  
 knitr        1.20       2018-02-20 CRAN (R 3.4.3)                  
 magrittr     1.5        2014-11-22 CRAN (R 3.4.1)                  
 memoise      1.1.0      2017-04-21 CRAN (R 3.4.1)                  
 methods    * 3.4.4      2018-03-16 local                           
 openssl      1.0.1      2018-03-03 CRAN (R 3.4.3)                  
 pacman     * 0.4.6      2017-05-14 CRAN (R 3.4.1)                  
 pillar       1.2.1      2018-02-27 CRAN (R 3.4.3)                  
 pkgconfig    2.0.1      2017-03-21 CRAN (R 3.4.1)                  
 plyr         1.8.4      2016-06-08 CRAN (R 3.4.1)                  
 R6           2.2.2      2017-06-17 CRAN (R 3.4.1)                  
 Rcpp         0.12.16    2018-03-13 CRAN (R 3.4.3)                  
 rlang        0.2.0.9000 2018-03-19 Github (tidyverse/rlang@1b81816)
 rmarkdown    1.9        2018-03-01 CRAN (R 3.4.3)                  
 rorcid     * 0.4.1.9210 2018-04-05 Github (ropensci/rorcid@c393ad0)
 rprojroot    1.3-2      2018-01-03 CRAN (R 3.4.3)                  
 rscopus    * 0.5.3      2017-10-11 CRAN (R 3.4.2)                  
 stats      * 3.4.4      2018-03-16 local                           
 stringi      1.1.7      2018-03-12 CRAN (R 3.4.3)                  
 stringr      1.3.0      2018-02-19 CRAN (R 3.4.3)                  
 tibble       1.4.2      2018-01-22 CRAN (R 3.4.3)                  
 tools        3.4.4      2018-03-16 local                           
 triebeard    0.3.0      2016-08-04 CRAN (R 3.4.1)                  
 urltools     1.7.0      2018-01-20 CRAN (R 3.4.3)                  
 utils      * 3.4.4      2018-03-16 local                           
 withr        2.1.2      2018-03-19 Github (jimhester/withr@79d7b0d)
 xfun         0.1        2018-01-22 CRAN (R 3.4.3)                  
 yaml         2.1.18     2018-03-08 CRAN (R 3.4.3)  
sckott commented 6 years ago

thanks for the question @gorkang

@rcpeters another question for you. Seems like journal name is shown in the ORCID UI for peer reviews. but I can't seem to find it either in the API response. Any guidance?

alainna commented 6 years ago

For peer reviews, the journal name (or publisher name, or organisation, etc) is going to be found in the group data.

https://pub.orcid.org/v2.1/0000-0001-7678-8656/peer-review/220419 ->

**issn:1939-2222** The peer review won't necessarily be grouped by the publisher or journal name -- review groups can be as specific or general as the review posting party desires. Generally the convening organisation will also be the party which has organised the review, which could be e.g. the journal or publisher. However I notice for Publons that they list this as Publons -- we'll be following up with them on that. Another example posted by AGU (GEMS) which has AGU listed as the convening party: https://pub.orcid.org/v2.1/0000-0002-7363-4552/peer-review/146242
sckott commented 6 years ago

thanks @alainna for that, that's what we needed review-group-id, sorry i missed that

id = "0000-0001-7678-8656"
x = orcid_peer_reviews(id, put_code = "220419")[[1]]
rcrossref::cr_journals(strsplit(x$`review-group-id`, ":")[[1]][[2]])$data$title
#> [1] "Journal of Experimental Psychology General"
rcpeters commented 6 years ago

Just a note not all group IDs are required to be ISSNs.

  select split_part(group_id,':',1) as prefix, count(*) from group_id_record group by prefix;
       prefix      | count 
  -----------------+-------
   publons         |  1297
   orcid-generated |   100
   ringgold        |     1
   issn            | 13181
  (4 rows)
sckott commented 6 years ago

Thanks @rcpeters - well i guess we can try to detect if it's an ISSN, and if so, we can try to grab the journal name

sckott commented 6 years ago

@gorkang does this solultion https://github.com/ropensci/rorcid/issues/52#issuecomment-379971887 work for you? I don't think we want to integrate rcrossref into this pkg, but we could document how to work with it to get publication title names. Thoughts?

gorkang commented 6 years ago

Thanks @sckott for checking back on this.

Yes, looking for the journal name using issn works, although it is very slow, so it adds ~15s for each researcher I have (see code below).

get_orcid_reviews <- function(id) {

 # id = "0000-0001-7678-8656" #

  library(pacman)
  p_load(dplyr, rorcid)

  tictoc::tic()

  # Get reviews ---------------------------------------------------------------
  temp_reviews = orcid_peer_reviews(id)[[1]]$group$`peer-review-summary` %>%
    bind_rows() 

    years_reviews = temp_reviews %>% 
      # filter(`completion-date.year.value` >= from_year) %>% # we only ask for the records we need to minimize # of calls.
      pull(`completion-date.year.value`) #`put-code`

    # Get journal titles ------------------------------------------------------

      # Get put-codes
      put_codes = temp_reviews %>% pull(`put-code`)

        # Get details for reviews using put-codes
        list_orcid_reviews <- orcid_peer_reviews(id, put_code = put_codes)

          # Get issn
          issn_reviews = 1:length(list_orcid_reviews) %>% purrr::map(~strsplit(list_orcid_reviews[[.x]]$`review-group-id`, ":")[[1]][[2]]) %>% unlist()

            # Get journal name using issn
            journal_names = rcrossref::cr_journals(issn_reviews)$data$title

    # Tidy data ---------------------------------------------------------------

    df_reviews = years_reviews %>% as_tibble() %>% 
      mutate(orcid_id = id) %>% 
      left_join(df_orcid_names, by = "orcid_id") %>% 
      rename(year = value) %>% 
      mutate(journal_name = journal_names) %>% 
      select(-other_names)

    tictoc::toc()
    df_reviews

}  

get_orcid_reviews( id = "0000-0001-7678-8656")

Taking those extra 15s for each researcher feels particularly wasteful as the the journal name is in the ORCID website (but for some reason not in the ORCID data):

screenshot from 2018-05-12 06-52-36

Any idea to make it faster would be greatly appreciated.

Thanks!

sckott commented 6 years ago

@gorkang just took another look at this.

i can't replicate your function above because the object df_orcid_names is missing, but I think i have a solution.

I just added a dataset of issn's and journal titles gathered from crossref, i need to work out a process for updating it, or letting users do so, but is much faster. e..,g,

system.time({
  id = "0000-0001-7678-8656"
  x = orcid_peer_reviews(id, put_code = "220419")[[1]]
  issn <- strsplit(x$`review-group-id`, ":")[[1]][[2]]
  rcrossref::cr_journals(issn)$data$title
})

 user  system elapsed
0.071   0.003   0.774

system.time({
  id = "0000-0001-7678-8656"
  x = orcid_peer_reviews(id, put_code = "220419")[[1]]
  issn <- strsplit(x$`review-group-id`, ":")[[1]][[2]]
  issn_title[[issn]]
})
 user  system elapsed
0.010   0.001   0.102
gorkang commented 6 years ago

Thanks @sckott for taking another look at this.

The new method does work better, but fails when the issn is not in issn_title.rda (btw, I had to download it manually. Maybe it does not load with the package?)

So, to solve the first point, I created a function to get the title with the best available method:

 get_title_from_issn <- function(issn) {
    load("issn_title.rda") # CHANGE PATH AS NEEDED
    tryCatch(issn_title[[issn]], error = function(e) {rcrossref::cr_journals(issn)$data$title})
  }
  journal_names = issn_reviews %>% purrr::map( ~ get_title_from_issn(.x)) %>% unlist()

In the specific case I am trying, there are 6 out of 20 issn not present in issn_title.rda. The time it takes goes down from ~31 to ~17 seconds.

Please, see the full code below. I adapted the get_orcid_reviews() function so you can select the "method" (new or old). Sorry for leaving df_orcid_names in the previous code. Now it should work.

get_orcid_reviews <- function(id, method = "new") {

  library(pacman)
  p_load(dplyr, rorcid)

  tictoc::tic()

  # Get reviews ---------------------------------------------------------------
  temp_reviews = orcid_peer_reviews(id)[[1]]$group$`peer-review-summary` %>%
    bind_rows() 

  years_reviews = temp_reviews %>% 
    # filter(`completion-date.year.value` >= from_year) %>% # we only ask for the records we need to minimize # of calls.
    pull(`completion-date.year.value`) #`put-code`

  # Get journal titles ------------------------------------------------------

  # Get put-codes
  put_codes = temp_reviews %>% pull(`put-code`)

  # Get details for reviews using put-codes
  list_orcid_reviews <- orcid_peer_reviews(id, put_code = put_codes)

  # Get issn
  issn_reviews = 1:length(list_orcid_reviews) %>% purrr::map(~strsplit(list_orcid_reviews[[.x]]$`review-group-id`, ":")[[1]][[2]]) %>% unlist()

  # GET JOURNAL NAMES -------------------

  # METHOD A (slow) Get journal name using issn
  if (method == "old") {
    journal_names = rcrossref::cr_journals(issn_reviews)$data$title

  # METHOD B (new) Get journal name using issn
  } else if (method == "new") {
      get_title_from_issn <- function(issn) {
        load("dev/BUGS/BUG - reviews slow/issn_title.rda")
        tryCatch(issn_title[[issn]], error = function(e) {rcrossref::cr_journals(issn)$data$title})
      }
      journal_names = issn_reviews %>% purrr::map( ~ get_title_from_issn(.x)) %>% unlist()
  }

  # Tidy data ---------------------------------------------------------------

  df_reviews = years_reviews %>% as_tibble() %>% 
    mutate(orcid_id = id) %>% 
    # left_join(df_orcid_names, by = "orcid_id") %>% 
    rename(year = value) %>% 
    mutate(journal_name = journal_names) #%>% select(-other_names)

  tictoc::toc()
  df_reviews

}  

get_orcid_reviews(id = "0000-0001-7678-8656", method = "old")

get_orcid_reviews(id = "0000-0001-7678-8656", method = "new")

Thanks!

sckott commented 6 years ago

sorry for the long delay in responding @gorkang - its not clear from your last reply if you are happy with changes, or there's still some improvements we can make?

gorkang commented 6 years ago

No problem @sckott . Last time I checked, there were two problems:

1) The function failed when the issn was not in issn_title.rda 2) I had to download issn_title.rda manually

Cheers.

sckott commented 6 years ago

I'm not having that problem. just removed rorcid then reinstalled from github, loaded rorcid and issn_title is there in the session. will keep thinking about what the problem could be

gorkang commented 6 years ago

Regarding the first issue:

If an ISSN exists, it works great. If it does not exist, gives an error:

issn_title[["1939-2222"]]
[1] "Journal of Experimental Psychology General"

issn_title[["0000-2222"]]
Error in issn_title[["0000-2222"]] : subscript out of bounds

With a function such as the following, we can avoid the error:

  get_title_from_issn <- function(issn) {
    tryCatch(issn_title[[issn]], error = function(e) {rcrossref::cr_journals(issn)$data$title})
  }

Regarding the second issue. After uninstalling using the gui it wasn't working, but using the remove.packages() function worked:

remove.packages("rorcid")
devtools::install_github("ropensci/rorcid")
library('rorcid')

Also, a final comment, for a single researcher with 20 review records (6 not in the issn_title file) it takes about 10s to fetch the journal titles. It is much better than the ~30s it used to take, but hopefully, there is still some room for improvement.

Thanks!

sckott commented 6 years ago

thanks - i'll take another look at the issn issue.

hopefully, there is still some room for improvement.

we'll continue to look for performance improvements 👍

sckott commented 6 years ago

note: still no ISSNs in the Crossref API /journals route, so can't work on update flow for the issn titles dataset

sckott commented 5 years ago

closing for now - added the script for updating the issn_title dataset in inst/ignore/issn_title_collect.R