ropensci / openalexR

Getting bibliographic records from OpenAlex
https://docs.ropensci.org/openalexR/
Other
89 stars 19 forks source link

improve paging control #183

Closed trangdata closed 9 months ago

trangdata commented 9 months ago

Resolves #166.

Hi @rkrug, could you try installing this branch and see if this works for your use case in #166?

# install.packages("remotes")
remotes::install_github("ropensci/openalexR@generalize-paging")

For example, what you can now do is specifying the pages:

chunk_size <- 5
for (i in 1:3){
  start <- chunk_size*(i-1) + 1
  end <- chunk_size*i
  oa_fetch(
    search = "transformative change", 
    pages = start:end,
    per_page = 20,
    verbose = TRUE
  ) |>
    saveRDS(file = sprintf("tfc-nature_p%s-%s.rds", start, end))
}

One thing I'd like to note is that concatenating these rds files later may still raise memory issues if you're trying to do this in R. I recommend checking out arrow::write_parquet to save these outputs as parquet files which would likely make it easier to combine later, potentially outside of R.

rkrug commented 9 months ago

@trangdata Thanks a lot - this looks good, but there is one shortcoming: I have to do it essentially manually. I have to get the number of pages (an example should show how) and then iterate through the pages. I still think, that an automatic saving (and I will come to the format in a second) would be the best.

The following should work for an automatic workflow:

  1. add argument save_pages_directory, which you can set if you want to save automatically the individual pages. There are many options one could fine tune this.
  2. add second argument save_only which if TRUE only saves and returns a character string containing the directory name whee the pages are saved - this would make it possible, to use a temporary directory, and does not try to concatenate all results.

Regarding the format of saving: I was also thinking of arrow - and I would personally use it, but I do not think it is suitable for the default. That is why I included writeRDS(). But arrow sliced by page would be perfect. One could always, if preferred, re-slice it later by e.g. year.

To offer the flexibility of saving in different formats, one could introduce in oa_request()an argumentsave_function = saveRDS()which specifies how to save. This function should have the signaturefunction(x, file, ...)where onlyxis andfile` is used. This also gives the possibility for more complex saving mechanisms.

But these are many options, and I think that where and how these parameter / options are set should be discussed in the new discussion.

One final point: I see the pages argument as a power user feature, and the oa_fetch() as a "normal" user command. I would suggest to consider leave oa_fetch() as it is, and put the new features in the function oa_request(), and mention to use oa_query() |> oa_request() instead of oa_fetch() when power user features are needed.

trangdata commented 9 months ago

I have to get the number of pages (an example should show how) and then iterate through the pages.

I agree. We should include an example to calculate the expected number of pages. Perhaps you could contribute a vignette on "Paging control" given your use case?

I would suggest to consider leave oa_fetch() as it is, and put the new features in the function oa_request(), and mention to use oa_query() |> oa_request() instead of oa_fetch() when power user features are needed.

I think we want to stay consistent with the behavior of oa_fetch in general, i.e. it's simply a wrapper around oa_query and oa_request. It inherits almost all parameters from these lower-level functions. Again, a conversation about which "power user" params should go into options should continue in #182, but I'm currently against making any parameters exclusive to oa_request.

On your larger point of providing an option for saving the individual pages, I still want to keep any serialization/IO stuff outside of the package. Especially with the flexibility that we would potentially have to support: directory, file name/type, save functions, number of pages to save in each iteration, and other options like you said, I think it's best to leave this on the user's side of things.