ropensci / openalexR

Getting bibliographic records from OpenAlex
https://docs.ropensci.org/openalexR/
Other
89 stars 19 forks source link

Save pages returned by `oa_request()` and do not process further #181

Closed rkrug closed 9 months ago

rkrug commented 9 months ago

This is inspired by #166

It is. not possible, due to memory limitations, to download a large number of works (thousands, millions). This pull request introduces an argument save_pages to the function oa_request which, if provided, saves all downloaded pages in the directory specified by the argument save_pages. No further processing of the downloaded pages is done, as this would again lead to memory problems, a vector containing the file names of the pages is returned.

This is at the moment a very rudimentary implementation, and could be tweaked further for performance, but I think this is not needed as a quick profiling indicated that the API calls take about 80% of the time, while the saveRDS() only 10%.

rkrug commented 9 months ago

From my experience with snowballing starting from more than 3000 papers, the limitation is really memory, i.e. the memory management of R. If one would download one page, then reserve the memory needed for one object to hold all data from all pages (e.g. rep(ONE_PAGE_DATA_structure, n_pages) by creating a dummy object and replacing after each page the page in the object, it could work - unless the final object is to large (and this would speed the process up as only one memory allocation is needed). But still - I guess that the total memory limit kicks in sooner or later. So the fallback solution of saving each page using saveRDS would be a safety net, which one could use if a) an argument is set, or b) the object containing the data I can not be created.

Just my thoughts, and looking forward to your pull request, @trangdata

yjunechoe commented 9 months ago

I'll wait to comment more until we resolve our discussion elsewhere on the design related issues, but I'll just add that I'm also a little wary of using saveRDS. It's a weird middle ground between where you apply some but not all processing to the data received from OA. I think we should either just dump out the JSON response exactly as we receive it, OR process it a bit further (ex: at least make it tabular) so that we can use a more efficient storage format (even csv is better than rds).

rkrug commented 9 months ago

Dumping the JSON response would make perfect sense - but probably as an RDS to safe space when downloading large sets. Let's discuss this further when the other discussion has reached consensus.

trangdata commented 9 months ago

I'm also a little wary of using saveRDS. It's a weird middle ground between where you apply some but not all processing to the data received from OA.

@yjunechoe Not sure you saw my comment in my Improve paging control PR, but I'll reiterate it here: at the moment, I think it's out of scope for openalexR to handle any type of serialization/IO. We currently provide enough flexibility for the user to do this themselves.

In general, I think making many requests to the Open Alex database is not a common use case, and I would generally advise against it. One can download the snapshot themselves and proceed with the filtering/processing without needing openalexR.

Therefore, my current suggestion would be to close this PR.

yjunechoe commented 9 months ago

Ah I'd missed the discussion going over there! Yes, I don't disagree with just closing this as out of scope (my comment was more of a "if we do implement this" comment)

yjunechoe commented 9 months ago

Closing this as out of scope for the package, in favor of #183 and the continuing discussion on #182 (where we can resolve how to streamline this workflow without introducing too much maintenance burden). Apologies for my earlier confusion and thanks for putting the work to explore this direction!