Closed rkrug closed 9 months ago
From my experience with snowballing starting from more than 3000 papers, the limitation is really memory, i.e. the memory management of R.
If one would download one page, then reserve the memory needed for one object to hold all data from all pages (e.g. rep(ONE_PAGE_DATA_structure, n_pages
) by creating a dummy object and replacing after each page the page in the object, it could work - unless the final object is to large (and this would speed the process up as only one memory allocation is needed).
But still - I guess that the total memory limit kicks in sooner or later. So the fallback solution of saving each page using saveRDS
would be a safety net, which one could use if a) an argument is set, or b) the object containing the data I can not be created.
Just my thoughts, and looking forward to your pull request, @trangdata
I'll wait to comment more until we resolve our discussion elsewhere on the design related issues, but I'll just add that I'm also a little wary of using saveRDS. It's a weird middle ground between where you apply some but not all processing to the data received from OA. I think we should either just dump out the JSON response exactly as we receive it, OR process it a bit further (ex: at least make it tabular) so that we can use a more efficient storage format (even csv is better than rds).
Dumping the JSON response would make perfect sense - but probably as an RDS to safe space when downloading large sets. Let's discuss this further when the other discussion has reached consensus.
I'm also a little wary of using saveRDS. It's a weird middle ground between where you apply some but not all processing to the data received from OA.
@yjunechoe Not sure you saw my comment in my Improve paging control PR, but I'll reiterate it here: at the moment, I think it's out of scope for openalexR to handle any type of serialization/IO. We currently provide enough flexibility for the user to do this themselves.
In general, I think making many requests to the Open Alex database is not a common use case, and I would generally advise against it. One can download the snapshot themselves and proceed with the filtering/processing without needing openalexR.
Therefore, my current suggestion would be to close this PR.
Ah I'd missed the discussion going over there! Yes, I don't disagree with just closing this as out of scope (my comment was more of a "if we do implement this" comment)
Closing this as out of scope for the package, in favor of #183 and the continuing discussion on #182 (where we can resolve how to streamline this workflow without introducing too much maintenance burden). Apologies for my earlier confusion and thanks for putting the work to explore this direction!
This is inspired by #166
It is. not possible, due to memory limitations, to download a large number of works (thousands, millions). This pull request introduces an argument
save_pages
to the functionoa_request
which, if provided, saves all downloaded pages in the directory specified by the argumentsave_pages
. No further processing of the downloaded pages is done, as this would again lead to memory problems, a vector containing the file names of the pages is returned.This is at the moment a very rudimentary implementation, and could be tweaked further for performance, but I think this is not needed as a quick profiling indicated that the API calls take about 80% of the time, while the
saveRDS()
only 10%.