ropensci / openalexR

Getting bibliographic records from OpenAlex
https://docs.ropensci.org/openalexR/
Other
89 stars 19 forks source link

Download large number of works - automatic slicing? #166

Closed rkrug closed 9 months ago

rkrug commented 10 months ago

Hi

I need to download a huge number of works but expectedly, this is not working at the moment due to memory issues (I had them before in snowballing, and could solve them by snowballing only a subset of references).

At the moment I am using this code:

openalexR::oa_query(
        search = paste0(
            "(", params$s_tfc, ")", " AND ",
            "(", params$s_nat, ")"
        )
    )  |>  
    openalexR::oa_request(count_only = FALSE, verbose = TRUE) |>
    saveRDS(file = file.path(".", "tfc_nature.rds"))

Is there any way, that I can fetch the works in batches of e.g. 5,000 references?

Thanks,

Rainer

trangdata commented 10 months ago

Hi Rainer, any chance you could give me a clearer example? Not sure what your params is...

In any case, my guess is that you can do something similar to how we wrote oa_request: https://github.com/ropensci/openalexR/blob/c56e029d43eb1c82b10893351595cdc3c806fe14/R/oa_fetch.R#L363-L372 Basically, first get the count of the number of items you will receive back, then run the query with cursor set to the next page. You may find this explanation on cursor paging very helpful.

rkrug commented 9 months ago

The problem is, that oa_request() is trying to download all pages, combines them to one object, and afterwards converts this huge object.

What I would like to have is, that there is kind of a "safe mode" which is, after one page is downloaded, is processing this page and saves it, so that the memory requirement is dramatically reduced. OK. There would be thousands of files, but these can be concatenated in a second step after all page processing is done.

Hope this clarifies my question?

OK - there might be a speed penalty, but I could live with that.

openalexR::oa_query(search = "transformative change" ) |>  
    openalexR::oa_request(count_only = FALSE, verbose = TRUE) |>
    saveRDS(file = file.path(".", "tfc_nature.rds"))
rkrug commented 9 months ago

Please see pull request https://github.com/ropensci/openalexR/pull/181 for a possible solution.