ropensci / openalexR

Getting bibliographic records from OpenAlex
https://docs.ropensci.org/openalexR/
Other
89 stars 19 forks source link

Usage of `oa_generate()` #223

Closed rkrug closed 3 months ago

rkrug commented 3 months ago

I have a question about the usage of the oa_generate() function.

I will use your example:

query_url <- "https://api.openalex.org/works?filter=cites%3AW2755950973"
oar <- oa_generate(query_url, verbose = TRUE)

I have two questions:

  1. How can I find out how many elements / records there are to iterate over them?
  2. This is related - how can I do something after each block of 1000 references? I want top save 1000 references into a file.

Thanks for any hints (I did not see anything in the Coro documentation).

rkrug commented 3 months ago

I think there is a missing implementation of the exhaustion sentinel in the oa_generate function.

When I run the attached example, I would expect, as shown in the Coro example at https://coro.r-lib.org/reference/collect.html#ref-examples to loop over all existing, and return gracefully at the end when the end is reached. This is not the case when using oa_generate() which does produce an error at the end.

Here https://github.com/ropensci/openalexR/blob/558581c6dbb43c65cd2003be8545e88fd4ed4ef7/R/coro.R#L78 is the error raised.

library(openalexR)
#> Thank you for using openalexR!
#> To acknowledge our work, please cite the package by calling `citation("openalexR")`.
#> To suppress this message, add `openalexR.message = suppressed` to your .Renviron file.
library(coro)
query_url <- "https://api.openalex.org/works?page=1&filter=authorships.author.id:a5056969703"
oar <- oa_generate(query_url, verbose = TRUE)
loop(for (x in oar) print(x$id))
#> Getting record 1 of 39 records...
#> [1] "https://openalex.org/W2106507833"
#> Getting record 2 of 39 records...
#> [1] "https://openalex.org/W2199677616"
#> Getting record 3 of 39 records...
#> [1] "https://openalex.org/W2162602285"
#> Getting record 4 of 39 records...
#> [1] "https://openalex.org/W2159666282"
#> Getting record 5 of 39 records...
#> [1] "https://openalex.org/W2976645051"
#> Getting record 6 of 39 records...
#> [1] "https://openalex.org/W1983401307"
#> Getting record 7 of 39 records...
#> [1] "https://openalex.org/W2157059322"
#> Getting record 8 of 39 records...
#> [1] "https://openalex.org/W2038407429"
#> Getting record 9 of 39 records...
#> [1] "https://openalex.org/W4365503086"
#> Getting record 10 of 39 records...
#> [1] "https://openalex.org/W1969325317"
#> Getting record 11 of 39 records...
#> [1] "https://openalex.org/W2882976929"
#> Getting record 12 of 39 records...
#> [1] "https://openalex.org/W2123594916"
#> Getting record 13 of 39 records...
#> [1] "https://openalex.org/W2046185497"
#> Getting record 14 of 39 records...
#> [1] "https://openalex.org/W2024784917"
#> Getting record 15 of 39 records...
#> [1] "https://openalex.org/W2023427813"
#> Getting record 16 of 39 records...
#> [1] "https://openalex.org/W1537966574"
#> Getting record 17 of 39 records...
#> [1] "https://openalex.org/W4319299492"
#> Getting record 18 of 39 records...
#> [1] "https://openalex.org/W4327808593"
#> Getting record 19 of 39 records...
#> [1] "https://openalex.org/W4387613452"
#> Getting record 20 of 39 records...
#> [1] "https://openalex.org/W20571766"
#> Getting record 21 of 39 records...
#> [1] "https://openalex.org/W4282595084"
#> Getting record 22 of 39 records...
#> [1] "https://openalex.org/W2038641408"
#> Getting record 23 of 39 records...
#> [1] "https://openalex.org/W56317627"
#> Getting record 24 of 39 records...
#> [1] "https://openalex.org/W101394771"
#> Getting record 25 of 39 records...
#> [1] "https://openalex.org/W2098352258"
#> Getting record 26 of 39 records...
#> [1] "https://openalex.org/W1677797717"
#> Getting record 27 of 39 records...
#> [1] "https://openalex.org/W2025168884"
#> Getting record 28 of 39 records...
#> [1] "https://openalex.org/W815697796"
#> Getting record 29 of 39 records...
#> [1] "https://openalex.org/W3209374073"
#> Getting record 30 of 39 records...
#> [1] "https://openalex.org/W3212083376"
#> Getting record 31 of 39 records...
#> [1] "https://openalex.org/W1516636941"
#> Getting record 32 of 39 records...
#> [1] "https://openalex.org/W1549034740"
#> Getting record 33 of 39 records...
#> [1] "https://openalex.org/W3093581611"
#> Getting record 34 of 39 records...
#> [1] "https://openalex.org/W3113384738"
#> Getting record 35 of 39 records...
#> [1] "https://openalex.org/W3211283666"
#> Getting record 36 of 39 records...
#> [1] "https://openalex.org/W3131237577"
#> Getting record 37 of 39 records...
#> [1] "https://openalex.org/W3177179003"
#> Getting record 38 of 39 records...
#> [1] "https://openalex.org/W1521474490"
#> Getting record 39 of 39 records...
#> [1] "https://openalex.org/W1524143971"
#> Getting record 40 of 39 records...
#> Error in res[[result_name]][[(i - 1)%%200 + 1]]: subscript out of bounds

Created on 2024-03-23 with reprex v2.1.0

rkrug commented 3 months ago

Even earlier, before

https://github.com/ropensci/openalexR/blob/558581c6dbb43c65cd2003be8545e88fd4ed4ef7/R/coro.R#L70C1-L71C1

one could check if i > n_times and if TRUE, return the exhaustion sentinel .

if (I > n_times) {
   return(coro::exhausted()
}

This is untested but I assume this is how it should be used as it is used e.g. in https://github.com/r-lib/coro/blob/c29ba7d0145ddb20e8a8857b6591f73ac18aa8d2/R/generator.R#L187-L189

trangdata commented 3 months ago

Thank you for pointing out this issue @rkrug!

I introduced a bug in #184 while trying to account for the case of group_by. #224 should resolve this issue.

how can I do something after each block of 1000 references? I want top save 1000 references into a file.

I added to the doc an example that would answer your question (for block of 100 records in the example).

How can I find out how many elements / records there are to iterate over them?

For the case without group_by, if you set verbose = TRUE, when you call oar() for the first time, you'll see how many records you have in total.

With group_by, unfortunately we don't have a way to know the total until we query until exhausted.

rkrug commented 3 months ago

Thanks for responding so quickly. Now the loop() from Coro could be used - correct? That would also be a good example.

In the case of group_by, I assume that there is no error at the end? Otherwise using a try() block would be an option?

trangdata commented 3 months ago

In the case of group_by, I assume that there is no error at the end? Otherwise using a try() block would be an option?

oa_generate takes care of this, so we can still go until exhausted. We just don't know in advance how many groups there are in total.

Now the loop() from Coro could be used - correct?

Yes! I can add this example as well.

trangdata commented 3 months ago

@rkrug I added the loop example for your use case (Example 3). Please let me know if you run into any trouble running them in that branch.

rkrug commented 3 months ago

@trangdata Thanks a lot - looks great. A few minor points:

  1. The examples do not show when calling when using ?oa_generate. They are not ion man/oa_generate.rd
  2. I have a slighty different version for example 3 now:
            title_and_abstract_search = "biodiversity AND nuclear"
            set_size = 100

            output_path <- tempfile()
            dir.create(output_path)

            oar <- openalexR::oa_query(
                title_and_abstract.search = title_and_abstract_search,
            ) |>
                oa_generate(
                    verbose = TRUE
                )

            set <- NULL
            set_no <- 0

            coro::loop(
                for (x in oar) {
                    set <- c(set, list(x))
                    if ((length(set) >= set_size) | isTRUE(x == coro::exhausted())) {
                        saveRDS(set, file.path(output_path, paste0("set_", set_no, ".rds")))
                        set <- NULL # reset recs
                        set_no <- set_no + 1
                    }
                }
            )
            ### and save the last one
            saveRDS(set, file.path(output_path, paste0("set_", set_no, ".rds")))

           list.files(output_path)

This is extracted from the function corpus_download() which is at https://github.com/IPBES-Data/IPBES.R/blob/df06efe01856e5474ff0d84cbe75b702308d29a8/R/corpus_download.r#L93-L136

trangdata commented 3 months ago

Ah thanks you're right! I will regenerate the docs!

I would preallocate set so you don't have to reallocate memory for the growing data structure. This will help speed up the code especially if your set_size is large. So set = vector("list", set_size). And then within the loop: set[[i]] <- x. Also, you may want to take advantage of the coro::is_exhausted() function for a more canonical condition.

rkrug commented 3 months ago

Both points true -

But when I save the last set, I have to delete the not allocated oworks (NULL), and the code Is much simpler like this. Also, as the preallocation only pre-allocates the memory for the empty list structures, the memory need will lead to a re-allocation (I assume) as these are not pointers? but anyway - I think with the given size of set, the bottleneck is the download of the works and not the re-allocation.

I will look into Coro::is_exhausted - makes sense.