Closed trangdata closed 9 months ago
@trangdata Thanks a lot - this looks good, but there is one shortcoming: I have to do it essentially manually. I have to get the number of pages (an example should show how) and then iterate through the pages. I still think, that an automatic saving (and I will come to the format in a second) would be the best.
The following should work for an automatic workflow:
save_pages_directory
, which you can set if you want to save automatically the individual pages. There are many options one could fine tune this.save_only
which if TRUE
only saves and returns a character string containing the directory name whee the pages are saved - this would make it possible, to use a temporary directory, and does not try to concatenate all results.Regarding the format of saving: I was also thinking of arrow
- and I would personally use it, but I do not think it is suitable for the default. That is why I included writeRDS()
. But arrow
sliced by page would be perfect. One could always, if preferred, re-slice it later by e.g. year.
To offer the flexibility of saving in different formats, one could introduce in oa_request()an argument
save_function = saveRDS()which specifies how to save. This function should have the signature
function(x, file, ...)where only
xis and
file` is used. This also gives the possibility for more complex saving mechanisms.
But these are many options, and I think that where and how these parameter / options are set should be discussed in the new discussion.
One final point: I see the pages
argument as a power user feature, and the oa_fetch()
as a "normal" user command. I would suggest to consider leave oa_fetch()
as it is, and put the new features in the function oa_request()
, and mention to use oa_query() |> oa_request()
instead of oa_fetch()
when power user features are needed.
I have to get the number of pages (an example should show how) and then iterate through the pages.
I agree. We should include an example to calculate the expected number of pages. Perhaps you could contribute a vignette on "Paging control" given your use case?
I would suggest to consider leave oa_fetch() as it is, and put the new features in the function oa_request(), and mention to use oa_query() |> oa_request() instead of oa_fetch() when power user features are needed.
I think we want to stay consistent with the behavior of oa_fetch
in general, i.e. it's simply a wrapper around oa_query
and oa_request
. It inherits almost all parameters from these lower-level functions. Again, a conversation about which "power user" params should go into options
should continue in #182, but I'm currently against making any parameters exclusive to oa_request
.
On your larger point of providing an option for saving the individual pages, I still want to keep any serialization/IO stuff outside of the package. Especially with the flexibility that we would potentially have to support: directory, file name/type, save functions, number of pages to save in each iteration, and other options like you said, I think it's best to leave this on the user's side of things.
Resolves #166.
Hi @rkrug, could you try installing this branch and see if this works for your use case in #166?
For example, what you can now do is specifying the pages:
One thing I'd like to note is that concatenating these
rds
files later may still raise memory issues if you're trying to do this in R. I recommend checking outarrow::write_parquet
to save these outputs as parquet files which would likely make it easier to combine later, potentially outside of R.