ropensci / openalexR

Getting bibliographic records from OpenAlex
https://docs.ropensci.org/openalexR/
Other
91 stars 20 forks source link

Feature Request: `oa_fetch()` multithreaded? #129

Closed rkrug closed 1 year ago

rkrug commented 1 year ago

When snowball() on a large number of works can take quite some time/// Would it be possible (I don't know the limitations otf the OpenAlex API), to make this multithreaded? There could actually be more threads than cores used, as the limit is likely the bandwidth and response time of the API?

yjunechoe commented 1 year ago

Maybe things could be slightly faster but at the end of the day it is an API service so as you guessed, there's a hard limit to speed. From their website (emphasis mine):

The API is limited to 100,000 calls per day. If you need more, simply drop us a line at support@openalex.org. There is a burst rate limit of 10 requests per second. So calling multiple requests at the same time could lead to errors with code 429.

rkrug commented 1 year ago

OK. Makes sense.

Thanks.

rkrug commented 1 year ago

Oh - how many API calls are needed for a snowball() of around 2000 works?

yjunechoe commented 1 year ago

As in, if the input to oa_snowball() is 2000 openalex IDs? To be honest I'm not sure. I think it varies widely depending on how well cited the papers in your set are.

Actually one way to speed up the process is parallelizing the conversion from JSON to df. This step is actually slower than requesting, per paper. Maybe we'll revisit the code for this at some point

rkrug commented 1 year ago

OK. Thanks a lot. Amy improvement in speed would be great!

yjunechoe commented 1 year ago

Just some note to myself (since I've actually thought about it a bit too):

trangdata commented 1 year ago

Before doing any optimization, I think we need to really pinpoint what the bottleneck is, probably with profvis. Currently, the output list is not that deep, and I think improving simple_rapply would not yield much better speed.

On the side of API calls, I have experienced great speed improvement with OpenAlex Premium. You may want to write to the OpenAlex team to see if you could obtain an API key for a trial period @rkrug.

Still, I agree that the conversion to dataframes can be slow. @rkrug Could you share an example snippet of how you would do snowball for, say, 50 works? There may be a way to retain the output as lists until the very last step. This example would help us better diagnose where the slowness comes from.

rkrug commented 1 year ago

Nothing special I would say - calling snowball() with around 2500 ids.

So with the Premium, I would get faster API axes and more requests? Nice. I will look into this.

yjunechoe commented 1 year ago

@trangdata I'll move the performance stuff over to a new issue and do some more digging before I attempt anything, but just to comment one last thing re: simple_rapply - it takes surprisingly long just in this line inside oa3df():

https://github.com/ropensci/openalexR/blob/66f07433b5efbdff16c581da9fdb754fb649fb4b/R/oa2df.R#L173

I profiled it here - https://rpubs.com/yjunechoe/oa_snowball_profvis1. I'm not sure what about that implementation specifically is making it so slow for such a trivial task, but it takes up over 10% of total run time in my toy example (oa_snowball("W2589424942"))

image

Update: Just ran the example again with a modified oa2df() where I keep the simple_rapply() version (and just let its output garbage collect) and also test rrapply() for a side-by-side comparison:

image

trangdata commented 1 year ago

Amazing! Thanks so much @yjunechoe. 🌻 Surprising indeed! And yes a new issue would be great!

trangdata commented 1 year ago

FYI @yjunechoe I'm revising the code and we may not need simple_rapply after all!

trangdata commented 1 year ago

Thinking through this a little more:

I'm making some significant changes in #132. We'll have to add more tests to make sure that removing simple_rapply didn't break anything.

Regarding speed, I think we should keep in mind that a good amount of time is spent getting a response from the API (~ elapsed time - user time?).

In #132, I also added an options argument to oa_snowball (similar to how you would use it in oa_fetch). This would speed up your dataframe conversion a little bit if you can ignore some columns you don't need for the plot:

# myids is a character vector of work ids Rainer sent me
system.time({
  ilk_snowball <- oa_snowball(
    identifier = myids[1:20],
    verbose = TRUE,
    options = list(select = c("id", "display_name", "authorships", "referenced_works")),
    mailto = "Rainer@krugs.de"
  )
})
#  user  system elapsed 
# 2.795   0.043   5.157 

system.time({
  ilk_snowball <- oa_snowball(
    identifier = myids[1:20],
    verbose = TRUE,
    mailto = "Rainer@krugs.de"
  )
})
#  user  system elapsed 
# 6.110   0.161  10.113

Seeing this and the result from profvis, I'm not sure the json conversion is the bottleneck. I'm leaning toward NOT adding a new dependency to rcppsimdjson atm.

Last point: Take caution with memory. It's easy to run out of memory with oa_snowball. So, if you can chunk your work and maybe save the output of each step, then bring them all back together in a new session, I would try that. I think there is some caching going on behind the seen with httr::GET that we can't capture. Related: https://github.com/ropensci/openalexR/issues/95#issuecomment-1513964112