Closed rkrug closed 3 months ago
I think there is a missing implementation of the exhaustion sentinel in the oa_generate
function.
When I run the attached example, I would expect, as shown in the Coro example at https://coro.r-lib.org/reference/collect.html#ref-examples to loop over all existing, and return gracefully at the end when the end is reached. This is not the case when using oa_generate()
which does produce an error at the end.
Here https://github.com/ropensci/openalexR/blob/558581c6dbb43c65cd2003be8545e88fd4ed4ef7/R/coro.R#L78 is the error raised.
library(openalexR)
#> Thank you for using openalexR!
#> To acknowledge our work, please cite the package by calling `citation("openalexR")`.
#> To suppress this message, add `openalexR.message = suppressed` to your .Renviron file.
library(coro)
query_url <- "https://api.openalex.org/works?page=1&filter=authorships.author.id:a5056969703"
oar <- oa_generate(query_url, verbose = TRUE)
loop(for (x in oar) print(x$id))
#> Getting record 1 of 39 records...
#> [1] "https://openalex.org/W2106507833"
#> Getting record 2 of 39 records...
#> [1] "https://openalex.org/W2199677616"
#> Getting record 3 of 39 records...
#> [1] "https://openalex.org/W2162602285"
#> Getting record 4 of 39 records...
#> [1] "https://openalex.org/W2159666282"
#> Getting record 5 of 39 records...
#> [1] "https://openalex.org/W2976645051"
#> Getting record 6 of 39 records...
#> [1] "https://openalex.org/W1983401307"
#> Getting record 7 of 39 records...
#> [1] "https://openalex.org/W2157059322"
#> Getting record 8 of 39 records...
#> [1] "https://openalex.org/W2038407429"
#> Getting record 9 of 39 records...
#> [1] "https://openalex.org/W4365503086"
#> Getting record 10 of 39 records...
#> [1] "https://openalex.org/W1969325317"
#> Getting record 11 of 39 records...
#> [1] "https://openalex.org/W2882976929"
#> Getting record 12 of 39 records...
#> [1] "https://openalex.org/W2123594916"
#> Getting record 13 of 39 records...
#> [1] "https://openalex.org/W2046185497"
#> Getting record 14 of 39 records...
#> [1] "https://openalex.org/W2024784917"
#> Getting record 15 of 39 records...
#> [1] "https://openalex.org/W2023427813"
#> Getting record 16 of 39 records...
#> [1] "https://openalex.org/W1537966574"
#> Getting record 17 of 39 records...
#> [1] "https://openalex.org/W4319299492"
#> Getting record 18 of 39 records...
#> [1] "https://openalex.org/W4327808593"
#> Getting record 19 of 39 records...
#> [1] "https://openalex.org/W4387613452"
#> Getting record 20 of 39 records...
#> [1] "https://openalex.org/W20571766"
#> Getting record 21 of 39 records...
#> [1] "https://openalex.org/W4282595084"
#> Getting record 22 of 39 records...
#> [1] "https://openalex.org/W2038641408"
#> Getting record 23 of 39 records...
#> [1] "https://openalex.org/W56317627"
#> Getting record 24 of 39 records...
#> [1] "https://openalex.org/W101394771"
#> Getting record 25 of 39 records...
#> [1] "https://openalex.org/W2098352258"
#> Getting record 26 of 39 records...
#> [1] "https://openalex.org/W1677797717"
#> Getting record 27 of 39 records...
#> [1] "https://openalex.org/W2025168884"
#> Getting record 28 of 39 records...
#> [1] "https://openalex.org/W815697796"
#> Getting record 29 of 39 records...
#> [1] "https://openalex.org/W3209374073"
#> Getting record 30 of 39 records...
#> [1] "https://openalex.org/W3212083376"
#> Getting record 31 of 39 records...
#> [1] "https://openalex.org/W1516636941"
#> Getting record 32 of 39 records...
#> [1] "https://openalex.org/W1549034740"
#> Getting record 33 of 39 records...
#> [1] "https://openalex.org/W3093581611"
#> Getting record 34 of 39 records...
#> [1] "https://openalex.org/W3113384738"
#> Getting record 35 of 39 records...
#> [1] "https://openalex.org/W3211283666"
#> Getting record 36 of 39 records...
#> [1] "https://openalex.org/W3131237577"
#> Getting record 37 of 39 records...
#> [1] "https://openalex.org/W3177179003"
#> Getting record 38 of 39 records...
#> [1] "https://openalex.org/W1521474490"
#> Getting record 39 of 39 records...
#> [1] "https://openalex.org/W1524143971"
#> Getting record 40 of 39 records...
#> Error in res[[result_name]][[(i - 1)%%200 + 1]]: subscript out of bounds
Created on 2024-03-23 with reprex v2.1.0
Even earlier, before
one could check if i > n_times
and if TRUE
, return the exhaustion sentinel .
if (I > n_times) {
return(coro::exhausted()
}
This is untested but I assume this is how it should be used as it is used e.g. in https://github.com/r-lib/coro/blob/c29ba7d0145ddb20e8a8857b6591f73ac18aa8d2/R/generator.R#L187-L189
Thank you for pointing out this issue @rkrug!
I introduced a bug in #184 while trying to account for the case of group_by. #224 should resolve this issue.
how can I do something after each block of 1000 references? I want top save 1000 references into a file.
I added to the doc an example that would answer your question (for block of 100 records in the example).
How can I find out how many elements / records there are to iterate over them?
For the case without group_by, if you set verbose = TRUE
, when you call oar()
for the first time, you'll see how many records you have in total.
With group_by, unfortunately we don't have a way to know the total until we query until exhausted.
Thanks for responding so quickly. Now the loop() from Coro could be used - correct? That would also be a good example.
In the case of group_by, I assume that there is no error at the end? Otherwise using a try() block would be an option?
In the case of group_by, I assume that there is no error at the end? Otherwise using a try() block would be an option?
oa_generate takes care of this, so we can still go until exhausted. We just don't know in advance how many groups there are in total.
Now the loop() from Coro could be used - correct?
Yes! I can add this example as well.
@rkrug I added the loop example for your use case (Example 3). Please let me know if you run into any trouble running them in that branch.
@trangdata Thanks a lot - looks great. A few minor points:
?oa_generate
. They are not ion man/oa_generate.rd
title_and_abstract_search = "biodiversity AND nuclear"
set_size = 100
output_path <- tempfile()
dir.create(output_path)
oar <- openalexR::oa_query(
title_and_abstract.search = title_and_abstract_search,
) |>
oa_generate(
verbose = TRUE
)
set <- NULL
set_no <- 0
coro::loop(
for (x in oar) {
set <- c(set, list(x))
if ((length(set) >= set_size) | isTRUE(x == coro::exhausted())) {
saveRDS(set, file.path(output_path, paste0("set_", set_no, ".rds")))
set <- NULL # reset recs
set_no <- set_no + 1
}
}
)
### and save the last one
saveRDS(set, file.path(output_path, paste0("set_", set_no, ".rds")))
list.files(output_path)
This is extracted from the function corpus_download()
which is at https://github.com/IPBES-Data/IPBES.R/blob/df06efe01856e5474ff0d84cbe75b702308d29a8/R/corpus_download.r#L93-L136
Ah thanks you're right! I will regenerate the docs!
I would preallocate set
so you don't have to reallocate memory for the growing data structure. This will help speed up the code especially if your set_size is large. So set = vector("list", set_size)
. And then within the loop: set[[i]] <- x
. Also, you may want to take advantage of the coro::is_exhausted()
function for a more canonical condition.
Both points true -
But when I save the last set, I have to delete the not allocated oworks (NULL), and the code Is much simpler like this. Also, as the preallocation only pre-allocates the memory for the empty list structures, the memory need will lead to a re-allocation (I assume) as these are not pointers? but anyway - I think with the given size of set
, the bottleneck is the download of the works and not the re-allocation.
I will look into Coro::is_exhausted - makes sense.
I have a question about the usage of the
oa_generate()
function.I will use your example:
I have two questions:
Thanks for any hints (I did not see anything in the Coro documentation).