ropensci / openalexR

Getting bibliographic records from OpenAlex
https://docs.ropensci.org/openalexR/
Other
91 stars 20 forks source link

Refactor and optimization #132

Closed trangdata closed 10 months ago

trangdata commented 1 year ago

This will need more extensive testing...

Some cleanup and optimization so far:

Related: #129

trangdata commented 10 months ago

Re #127: the user can now set the environment variable openalexR.print to the number of characters in the printed query to shorten very long URLs:

library(openalexR)

w <- function() {
  oa_fetch(
    entity = "works",
    title.search = c("bibliometric analysis", "science mapping"),
    cited_by_count = ">50",
    options = list(select = "id"),
    from_publication_date = "2021-01-01",
    to_publication_date = "2021-12-31",
    verbose = TRUE
  )
}

w0 <- w()
#> Requesting url: https://api.openalex.org/works?filter=title.search%3Abibliometric%20analysis%7Cscience%20mapping%2Ccited_by_count%3A%3E50%2Cfrom_publication_date%3A2021-01-01%2Cto_publication_date%3A2021-12-31&select=id
#> Getting 1 page of results with a total of 63 records...
Sys.setenv(openalexR.print = 70)
w1 <- w()
#> Requesting url: https://api.openalex.org/works?filter=title.search%3Abibliometric%20an...
#> Getting 1 page of results with a total of 63 records...
Sys.unsetenv("openalexR.print")
w2 <- w()
#> Requesting url: https://api.openalex.org/works?filter=title.search%3Abibliometric%20analysis%7Cscience%20mapping%2Ccited_by_count%3A%3E50%2Cfrom_publication_date%3A2021-01-01%2Cto_publication_date%3A2021-12-31&select=id
#> Getting 1 page of results with a total of 63 records...

Created on 2023-10-24 with reprex v2.0.2

trangdata commented 10 months ago

Re #129: Previously, oa_snowball can take a long time. This refactor removes the use of simple_rapply and makes some improvement on speed.

Previously:

library(openalexR)
packageVersion("openalexR")
#> [1] '1.2.2'
w <- oa_fetch("works", options = list(sample = 20, seed = 1, select = "id"))
myids <- openalexR:::shorten_oaid(w$id)
system.time({
  ilk_snowball <- oa_snowball(
    identifier = myids,
    verbose = TRUE
  )
})
#> Requesting url: https://api.openalex.org/works?filter=openalex%3AW2752822653%7CW2057540892%7CW2071641039%7CW2528237503%7CW4255644834%7CW2039776320%7CW1998173837%7CW2894916677%7CW4205808956%7CW4292916519%7CW2210922255%7CW2123690481%7CW2074469351%7CW4378553964%7CW2321856033%7CW2439084087%7CW2294799430%7CW2966056779%7CW1424334985%7CW2425037722
#> Getting 1 page of results with a total of 20 records...
#> Collecting all documents citing the target papers...
#> Requesting url: https://api.openalex.org/works?filter=cites%3AW2071641039%7CW1998173837%7CW2057540892%7CW2321856033%7CW4205808956%7CW2210922255%7CW2294799430%7CW2425037722%7CW2123690481%7CW1424334985%7CW2039776320%7CW2074469351%7CW2439084087%7CW2528237503%7CW2752822653%7CW2894916677%7CW2966056779%7CW4255644834%7CW4292916519%7CW4378553964
#> Getting 2 pages of results with a total of 324 records...
#> Collecting all documents cited by the target papers...
#> Requesting url: https://api.openalex.org/works?filter=cited_by%3AW2071641039%7CW1998173837%7CW2057540892%7CW2321856033%7CW4205808956%7CW2210922255%7CW2294799430%7CW2425037722%7CW2123690481%7CW1424334985%7CW2039776320%7CW2074469351%7CW2439084087%7CW2528237503%7CW2752822653%7CW2894916677%7CW2966056779%7CW4255644834%7CW4292916519%7CW4378553964
#> Getting 1 page of results with a total of 135 records...
#>    user  system elapsed 
#>   3.672   0.060  11.042

Now:

library(openalexR)
packageVersion("openalexR")
#> [1] '1.2.2.9999'
Sys.setenv(openalexR.print = 70)
w <- oa_fetch("works", options = list(sample = 20, seed = 1, select = "id"))
myids <- openalexR:::shorten_oaid(w$id)
system.time({
  ilk_snowball <- oa_snowball(
    identifier = myids,
    verbose = TRUE
  )
})
#> Requesting url: https://api.openalex.org/works?filter=openalex%3AW2752822653%7CW205754...
#> Getting 1 page of results with a total of 20 records...
#> Collecting all documents citing the target papers...
#> Requesting url: https://api.openalex.org/works?filter=cites%3AW2071641039%7CW199817383...
#> Getting 2 pages of results with a total of 324 records...
#> Collecting all documents cited by the target papers...
#> Requesting url: https://api.openalex.org/works?filter=cited_by%3AW2071641039%7CW199817...
#> Getting 1 page of results with a total of 135 records...
#>    user  system elapsed 
#>   2.089   0.049   4.103

We can also make it a little faster by specifying the fields we want in oa_snowball with options = list(select = c("id", "display_name", "authorships", "referenced_works")). Note that in the newest implementation, we allow different options for the core papers, the citing papers and the cited_by papers. Therefore, one will need to specify these options separately like so:

library(openalexR)
packageVersion("openalexR")
#> [1] '1.2.2.9999'
Sys.setenv(openalexR.print = 70)
w <- oa_fetch("works", options = list(sample = 20, seed = 1, select = "id"))
myids <- openalexR:::shorten_oaid(w$id)
my_opts <- list(select = c("id", "display_name", "authorships", "referenced_works"))
system.time({
  ilk_snowball <- oa_snowball(
    identifier = myids,
    options = my_opts,
    citing_params = list(options = my_opts),
    cited_by_params = list(options = my_opts),
    verbose = TRUE
  )
})
#> Requesting url: https://api.openalex.org/works?filter=openalex%3AW2752822653%7CW205754...
#> Getting 1 page of results with a total of 20 records...
#> Collecting all documents citing the target papers...
#> Requesting url: https://api.openalex.org/works?filter=cites%3AW2071641039%7CW199817383...
#> Getting 2 pages of results with a total of 324 records...
#> Collecting all documents cited by the target papers...
#> Requesting url: https://api.openalex.org/works?filter=cited_by%3AW2071641039%7CW199817...
#> Getting 1 page of results with a total of 135 records...
#>    user  system elapsed 
#>   0.898   0.016   2.075

Created on 2023-10-24 with reprex v2.0.2

rkrug commented 10 months ago

The specification of the fields seems to make a huge difference. Great.

trangdata commented 10 months ago

one test threw a warning

~Oh man did OpenAlex change its author IDs again? I'll check. All tests ran fine two days ago so I'm not sure why A2208157607 and A923435168 are no longer valid author ids.~ Hmm... so I think what happened is that I wasn't thorough enough in my update of author IDs in #167. Will update these IDs now.