Closed ds-jim closed 1 month ago
@ds-jim Thanks James, I will incorporate your suggestion. How large are your datasets? Could you provide some benchmark?
Just to reference the last time we talked seriously about performance - https://github.com/ropensci/openalexR/issues/129
@trangdata I'm looking at publication lists for entire conditions from 2000 onwards. I tried diabetes ~600k and it was still running 12+ hours later so I stopped and went smaller. I next tried a title only search for malaria which has 76,000 publications and after 30minutes it hadn't completed but I spotted the message about truncating authors so ran the search again but with output = "list"
and it completed in 18 minutes. I suspect the memory requirment is the issue in trying to "flatten" all that data. I'm using a 16Gb M1 MacBook.
First, some general points:
output= "list"
is indeed the way to go if you're fetching thousands of works in a single query. You can then incrementally process the output afterwards, and at that point you have the options to e.g., parallelize.works2df()
. This function is indeed a bit expensive to run, as it reconstructs the output = "list"
result into a specific "tidy" data frame. We put a lot of thought into how the various pieces of information scattered around the JSON should be organized into columns, and the complexity of works2df()
reflects this factThat said, I agree that things could be faster, especially for works2df()
. See the full profiling results here: https://rpubs.com/yjunechoe/works2df_profvis1
I benchmarked converting a 1000 works from output = "list"
using works2df()
:
res1_list <- oa_fetch(identifier = "W2741809807", output = "list")
res1000_list <- rep(list(res1_list), 1000)
profvis::profvis({
works2df(res1000_list)
})
Overall, it takes 4.5 seconds to process 1000 works which isn't to bad, but note that if you're doing 10s and 100s of thousands at a time, performance will be compounded by pressures on memory.
Digging closer into profvis, I see:
The top three most expensive calls are process_topis()
, mapply(..., subs_na)
, and subs_na()
. The performance cost of the 2nd and 3rd (accounting for ~25% of total execution time) are both driven by subs_na()
and in turn by replace_w_na()
. This was already a target of optimization back in #132 but I think we're seeing problems again here as OpenAlex is returning larger and more nested JSONs (this matters because, IIRC, we're visiting every node to replace NULL
with NA
). I feel like this is a very low-level optimization problem for which base R tools are just inadequate - I'll just float the idea of using {rrapply} for this again: https://github.com/ropensci/openalexR/issues/129#issuecomment-1642114224
But the greatest offender is process_topics()
(~half of total execution time), which I'm kind of surprised by. Zooming into process_topics()
, we actually see subs_na()
show up again taking up ~10% of execution time. tibble::rownames_to_column()
stands out to me as the obvious low hanging fruit here (~25% of total execution time) - we can just write our own util that does the same in base R.
The rest seem too trivial or scattered to act on. The next low hanging fruit would probably be optimizing this line of cbind()
code in process_topics()
, but it already only takes up <1% of execution time.
https://github.com/ropensci/openalexR/blob/774aff7c6160163bd7b28960e864425079b8d5a2/R/utils.R#L166
I think a good first step would be factoring out tibble::rownames_to_column()
. Then we can see if we can squeeze more performance out of subs_na()
(though I'm kind of ideas here)
Thank you, June, for the thorough analysis! You're right, and process_topics
was a function we recently added to accommodate the new topics
entity in OpenAlex, and I haven't profiled it. So this is super helpful! Do you want to submit a PR to grab these low hanging fruits first? 🙏🏽
I just saw https://www.r-bloggers.com/2024/09/json-null-values-and-as_tibble/ - possibly you saw it as well. I do not know if it helps, just wanted to drop it here.
Thanks @rkrug for the reference - the post you shared is a good approach in principle but from a quick glance I see that it involves a nested iteration of purrr::map_dfr()
which in turn calls purrr::map()
. I'm cautious of solutions which don't explicitly and intentionally optimize for performance - a quick benchmark shows that it's about 10 times slower than what we have currently via subs_na()
:
data_json <- '
[
{
"name": "Tim",
"age": 34,
"hobby": "footbal"
},
{
"name": "Tom",
"age": null,
"hobby": "baseball"
},
{
"name": "Shelly",
"age": 21,
"hobby": null
}
]
'
parsed_json <- rjson::fromJSON(data_json)
library(purrr)
fix_NULL <- function(variables) {
variables <- variables %>%
map(~ ifelse(is.null(.x), NA, .x))
as.data.frame(variables)
}
bench::mark(
current = openalexR:::subs_na(parsed_json, type = "rbind_df")[[1]],
purrr = map_dfr(parsed_json, ~ fix_NULL(.x)),
)
#> # A tibble: 2 Ă— 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 current 227.4µs 318.5µs 2750. 93.26KB 12.8
#> 2 purrr 2.9ms 3.82ms 247. 7.28MB 6.28
I think addressing the performance of subs_na()
is going to be difficult and we'd need to think very seriously about this, as what we have is already very optimized as far as base R goes. We may have to reach into C code, but I don't have good intuitions for that at the moment.
I did not read through it in detail - so this is likely the case. I remember seeing a json package using a c libra for handling json - maybe that would be an approach to start at the returned json level? Again, haven’t looked at it in detail. Sent from my iPhoneOn 8 Sep 2024, at 09:57, June Choe @.***> wrote:
Thanks @rkrug for the reference - the post you shared is a good approach in principle but from a quick glance I see that it involves a nested iteration of purrr::map_dfr() which in turn calls purrr::map(). I'm cautious of solutions which don't explicitly and intentionally optimize for performance - a quick benchmark shows that it's about 10 times slower than what we have currently via subs_na():
data_json <- '
[
{
"name": "Tim",
"age": 34,
"hobby": "footbal"
},
{
"name": "Tom",
"age": null,
"hobby": "baseball"
},
{
"name": "Shelly",
"age": 21,
"hobby": null
}
]
'
parsed_json <- rjson::fromJSON(data_json)
library(purrr)
fix_NULL <- function(variables) { variables <- variables %>% map(~ ifelse(is.null(.x), NA, .x)) as.data.frame(variables) }
bench::mark( current = openalexR:::subs_na(parsed_json, type = "rbind_df")[[1]], purrr = map_dfr(parsed_json, ~ fix_NULL(.x)), )
itr/sec
mem_alloc gc/sec
I think addressing the performance of subs_na() is going to be difficult and we'd need to think very seriously about this, as what we have is already very optimized as far as base R goes. We may have to reach into C code, but I don't have good intuitions for that at the moment.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>
Yes, there are some other packages for faster reading of json based on C/C++, like yyjsonr and rcppsimdjson. I'm a little hesitant to jump straight into integrating these tools though - I think we can squeeze a lot more mileage in performance from just promoting better practices for making large queries (ex: try to chunk them up, download as results are returned, separate the task of getting query results and converting them into data frames, etc.)
And just for completeness, back on the note of subs_na()
- previously I'd floated a suggestion for using {rrapply}, which is implemented in C. I'm not yet 100% sold on adding this extra dependency for maintenance reasons, but an obvious performance improvement like this would be ideal
parsed_json_many <- rep(parsed_json, 1000)
bench::mark(
cur = openalexR:::subs_na(parsed_json_many, type = "rbind_df")[[1]],
purrr = map_dfr(parsed_json_many, ~ fix_NULL(.x)),
rrapply = rrapply(rrapply(parsed_json_many, is.null, \(...) NA), how = "bind")
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 3 Ă— 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 cur 57.26ms 85.26ms 13.2 353KB 7.54
#> 2 purrr 1.27s 1.27s 0.790 484KB 3.16
#> 3 rrapply 6.12ms 6.71ms 131. 486KB 3.98
@trangdata I'm looking at publication lists for entire conditions from 2000 onwards. I tried diabetes ~600k and it was still running 12+ hours later so I stopped and went smaller. I next tried a title only search for malaria which has 76,000 publications and after 30minutes it hadn't completed but I spotted the message about truncating authors so ran the search again but with
output = "list"
and it completed in 18 minutes. I suspect the memory requirment is the issue in trying to "flatten" all that data. I'm using a 16Gb M1 MacBook.
@ds-jim What do you plan to analyze with these records? A few things that might help speed up your call:
options = list(select = ...)
argument of oa_fetch()
allows you to specify the fields you're interested in getting. For example, if you only need publication_year
and publication_date
, you can specify that in your call. (my example)output = "list"
might be the way to go if you don't need the data.frame output or plan to convert it to data.frame afteroa_generate()
allows you to build the query and get one record at a time, if you worry about memory. I have modified oa_request()
so that it saves each page after it is retrieved. This accomplished the same as @trangdata suggested oa_generate()
but in a, imho, easier way. I used it several times, downloading up to 4.5 million records. You can find it at https://github.com/IPBES-Data/IPBES.R/blob/dev/R/oa_request.R. Please ask if you have questions or want to follow that approach.
I wanna emphasize the options = list(select = ...)
+ output = "list"
combo even further. The performance of large queries benefits greatly from being careful and intentional about what kinds of information you care about, and (safely) making assumptions that let you take shortcuts. The lesson is: optimizations in code (like gsub
with fixed
vs. perl
) can only get you so far - more mileage can be gained from planning ahead.
For example, if you only care about scalar fields like id
, cited_by_count
, etc., it suffices to specify them in select
, fetch using output = "list"
, and combine into data frame afterwards with something more performant, like data.table::rbindlist()
.
The following code fetches and processes 10,000 works objects in 40 seconds.
library(dplyr)
malaria_topic <- oa_fetch(entity = "topics", search = "malaria") %>%
filter(display_name == "Malaria") %>%
pull(id)
malaria_topic
#> [1] "https://openalex.org/T10091"
# Scalar fields
select_fields <- c("id", "cited_by_count", "display_name")
system.time({
res <- oa_fetch(
topics.id = malaria_topic,
entity = "works",
verbose = TRUE,
options = list(sample = 10000, seed = 1,
select = select_fields),
output = "list"
)
})
#> Requesting url: https://api.openalex.org/works?filter=topics.id%3Ahttps%3A%2F%2Fopenalex.org%2FT10091&sample=10000&seed=1&select=id%2Ccited_by_count%2Cdisplay_name
#> Getting 50 pages of results with a total of 10000 records...
#> OpenAlex downloading [=====================] 100% eta: 0s
#> user system elapsed
#> 0.45 0.04 41.39
system.time({
res_df <- data.table::rbindlist(res)
})
#> user system elapsed
#> 0.00 0.00 0.02
res_df
#> Rows: 10,000
#> Columns: 3
#> $ id <chr> "https://openalex.org/W2331399312", "https://openalex.org/W3213308635", …
#> $ cited_by_count <int> 1, 0, 3, 28, 0, 155, 0, 15, 1, 2, 15, 76, 8, 1, 5, 0, 68, 1, 3, 0, 57, 4…
#> $ display_name <chr> "Antimalarial Activity of Zincke’s Salts", "1104 Update on the study of …
I completely agree.
One question: To make that whole process even more flexible, It could be an option to return not the list, but the json. This could be stored, and would open many possibilities, even accessing it via duckDB, which can read collections of json files as a dataset. This would also make the conversion easier, as duckDB could do the job.
Easiest option: specifying ja=son as output format, saves each page into a directory. This would catch many use cases, increase speed and be a useful feature for power users.
On 9 Sep 2024, at 17:22, June Choe @.***> wrote:
I wanna emphasize the options = list(select = ...) + output = "list" combo even further. The performance of large queries benefits greatly from being careful and intentional about what kinds of information you care about, and (safely) making assumptions that let you take shortcuts. The lesson is: optimizations in code (like gsub with fixed vs. perl) can only get you so far - more mileage can be gained from planning ahead.
For example, if you only care about scalar fields like id, cited_by_count, etc., it suffices to specify them in select, fetch using output = "list", and combine into data frame afterwards with something more performant, like data.table::rbindlist().
The following code fetches and processes 10,000 works objects in 40 seconds.
library(dplyr) malaria_topic <- oa_fetch(entity = "topics", search = "malaria") %>% filter(display_name == "Malaria") %>% pull(id) malaria_topic
> [1] "https://openalex.org/T10091"
Scalar fields
select_fields <- c("id", "cited_by_count", "display_name") system.time({ res <- oa_fetch( topics.id = malaria_topic, entity = "works", verbose = TRUE, options = list(sample = 10000, seed = 1, select = select_fields), output = "list" ) })
> Requesting url: https://api.openalex.org/works?filter=topics.id%3Ahttps%3A%2F%2Fopenalex.org%2FT10091&sample=10000&seed=1&select=id%2Ccited_by_count%2Cdisplay_name
> Getting 50 pages of results with a total of 10000 records...
> OpenAlex downloading [=====================] 100% eta: 0s
> user system elapsed
> 0.45 0.04 41.39
system.time({ res_df <- data.table::rbindlist(res) })
> user system elapsed
> 0.00 0.00 0.02
res_df
> Rows: 10,000
> Columns: 3
> $ id
"https://openalex.org/W2331399312", "https://openalex.org/W3213308635", … > $ cited_by_count
1, 0, 3, 28, 0, 155, 0, 15, 1, 2, 15, 76, 8, 1, 5, 0, 68, 1, 3, 0, 57, 4… > $ display_name
"Antimalarial Activity of Zincke’s Salts", "1104 Update on the study of … — Reply to this email directly, view it on GitHub https://github.com/ropensci/openalexR/issues/271#issuecomment-2338416246, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADW6BAYGON6STJZIN2SHGTZVW4KXAVCNFSM6AAAAABNY3IRWCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMZYGQYTMMRUGY. You are receiving this because you were mentioned.
-- Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation Biology, UCT), Dipl. Phys. (Germany)
Orcid ID: 0000-0002-7490-0066
Department of Evolutionary Biology and Environmental Studies University of ZĂĽrich Office Y19-M-72 Winterthurerstrasse 190 8075 ZĂĽrich Switzerland
Office: +41 (0)44 635 47 64 Cell: +41 (0)78 630 66 57 email: @. @.
PGP: 0x0F52F982
Hi @trangdata,
That's super helpful thank you. I'd completely missed the select
option within the function (sorry all), there's ton's of data coming back from Open Alex that I don't need, as @yjunechoe points out "select" with "list" is the ticket for larger data sources.
@rkrug I think your solution might help for the intermediate sizes but I have quite large area's I wanting to asses. I have all the results I need saved in list format as RDS. Even still, extracting these is proving difficult so my next stop was going to be converting the list back to JSON and loading into Postgres. I've previously found the JSON field there is handled extremely swiftly, although I'll give DuckDB a little try as there's less overhead required!
I think an option for output = "json"
would be helpful for those of us 1 step shy of importing the whole snapshot into BigQuery. However, the conversion from list to JSON should be fairly trivial.
@rkrug ~Is there a context where you'd want a raw json as opposed to a list? I can't imagine wanting a json instead of an R list unless you want to consume the data elsewhere like Javascript.~ Sorry - I somehow completely missed the point about databases. I think that's reasonable but more thoughts from me below:
Also, I raised this before https://github.com/ropensci/openalexR/issues/205#issuecomment-1938781679, but 1 thing to be careful of is that for large records that comes in batches via pages, a raw json output will actually not be a single json but a character vector of chucked json strings, and there's no way to merge them without parsing them first. Given that, I think output="json"
would be at best misleading. IMO output="list"
is as "raw" as we can get while returning a single query result.
Lastly, for parsing we use {jsonlite}
, which is round-trippable, so if the JSON format is important users can always recover it post-hoc (or, just do a simple GET on the output of oa_query()
).
Also, I raised this before https://github.com/ropensci/openalexR/issues/205#issuecomment-1938781679, but 1 thing to be careful of is that for large records that comes in batches via pages, a raw json output will actually not be a single json but a character vector of chucked json strings, and there's no way to merge them without parsing them first. Given that, I think output="json" would be at best misleading. IMO output="list" is as "raw" as we can get while returning a single query result.
Just for clarification, as I would OpenAlex to return a valid json when called to return a certain page. I save in my modified implementation of oa_request()
each page as a list (https://github.com/IPBES-Data/IPBES.R/blob/72d354e348ad225a7ddffe7a583ba708624afa0c/R/oa_request.R#L120-L130) which works. In a second step I convert these to parquet format (with several data standardisations in-between). So I assume, that the returned json must be valid. So saving it as json, should also be valid json.
If you are referring to individual records - Isn't the paging done per record by OpenAlex? and when looking at the code, I would say each page is returned as a n individual entry in data
(https://github.com/IPBES-Data/IPBES.R/blob/72d354e348ad225a7ddffe7a583ba708624afa0c/R/oa_request.R#L128)?
The nice thing about openalexR is the implementation of the paging - by using oa_query()
one has to do this manually. So quite bit of additional work which can be avoided.
Concerning databases: offering the option of saving the json per page, and a function which uses duckDB to read the json files (returning a database object which can be fed into a dplyr pipeline and read by using collect
at the end should really be the easiest and relatively straight forward method to make all conversions unnecessary as they are done by duckDB (in theory...).
We should really move this to a separate issue and not clutter this one up so our discussion on this doesn't get more scattered than it already is, but just to quickly respond:
I'm not opposed to a output = "raw/json"
in theory and I don't have any concerns about OpenAlex returning "invalid" JSON. It's just that the exact structure of the output and how to best ingest the JSON(s) depends on how the query is constructed, which is an implementational detail that's better hidden from users entirely - this is the primary goal of {openalexR}
as an API package that returns R data frames.
My main thought is that between output = "list"
and oa_query()
, {openalexR}
already provides enough of a lower-level interface for power users (amongst API packages, we already go above and beyond in providing these interfaces). The power power-users who'd like the raw JSON can simply recruit an additional, more powerful tool like {httr}
to hit the output of oa_query()
directly and grab its JSON contents from there - and then these power power users can also add their other preferred strategies on top like retry-ing, or maybe the database language provides remote streaming of JSON and you'd prefer that instead. All in all, we would rather encourage power users to take control of their own advanced strategies using what we already offer (namely, via oa_query()
), rather than bake the individually preferred strategies into the package itself. As I've said before, I prefer not to chase after ...
... a small net-positive for power users (who, in their diversity, may not even agree with the kind of [low-level features we implement] requiring them to write their own custom behaviors anyways)
I have to admit that the database feature is new to me, so maybe I could be convinced that the JSON file format could be of interest to a broader audience. But I wonder - how useful is the data after such an automatic ingestion into databases? The best parts of {openalexR}
are data-processing functions like works2df()
- why would one opt out of this (or data processing in R entirely)? And if a user wants to opt out of JSON processing, why not opt out a little earlier at oa_query()
and take control from there? As a developer, these questions about the certainty and generality of a feature's usefulness are what I'm most concerned about, not so much their technical feasibility and theoretical usecases. As I've stated before, my philosophy is always that development should be driven by the needs of the average user, especially because of our limited resources.
Thanks for your patience.
I will draw up an example of what I am thinking about using my modified version of oa_request()
and also add how to process it further - but it will likely not be this week. I will post it in a new thread.
For what it's worth, I think you're doing a great service by digging these issues very deeply! My concerns about "should the ability to do X be integrated into the package" are entirely separate from "should users get to know that they can do X" - your own implementation serves as a great example for the power users who want to get more out of {openalexR}
. If you want to share your approach here and link to your implementation for other power users to tweak and adopt, I'd be very happy with that.
Thanks. I will keep you posted.
Hi, I ran your code to fetch all the fields.
malaria_topic <- oa_fetch(entity = "topics", search = "malaria") %>% filter(display_name == "Malaria") %>% pull(id) malaria_topic
system.time({ res <- oa_fetch( topics.id = malaria_topic, entity = "works", verbose = TRUE, options = list(sample = 1000, seed = 1), output = "list" ) })
It took 76 seconds to pull out all the fields (40 seconds to pull out the select fields)
I used the following code to fetch 10,000 records. It took 260 seconds). The only add-on is the batch_identifiers array. Any thoughts?
system.time({ batch_identifiers <-org_ref_works_cited[1:10000] res <-oa_fetch(identifier=batch_identifiers, entity = "works", options= list(sample=10000, seed=1), output="list") })
Hi Yan — I've moved your question to #278.
Hi all — I'm closing this issue because the discussion is getting too fragmented. Please feel free to open new specific issues if you think there are points we have not fully resolved. 🙏🏽
Performance converting large datasets to data.frame is very slow. I doubt this will transform matters but every little helps.
In utils.R you have two functions which would run faster as explicit find and replace rather than a regex replace.
Original
Suggestion