ropensci / openalexR

Getting bibliographic records from OpenAlex
https://docs.ropensci.org/openalexR/
Other
97 stars 21 forks source link

openAlexR in and `parallel::mclapply()`: Multicore API cores fail when no single core API call was issued before. #189

Closed rkrug closed 7 months ago

rkrug commented 11 months ago

Hi

I am using parallel::mclapply() to make parallel API calls and these fail, when not a single core has been issued before:

library(openalexR)

## This fails:

parallel::mclapply(1:2, function(x){oa_request(oa_query("biodiversity"), count_only = TRUE)})

## Here is the single core call

oa_request(oa_query("biodiversity"), count_only = TRUE)

## Now it works:

parallel::mclapply(1:2, function(x){oa_request(oa_query("biodiversity"), count_only = TRUE)})

# And this works

The error message is:

objc[80975]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called.
objc[80976]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called.
objc[80975]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.
objc[80976]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.
[[1]]
NULL

[[2]]
NULL

Warning message:
In parallel::mclapply(1:2, function(x) { :
  scheduled cores 1, 2 did not deliver results, all values of the jobs will be affected

It might be necessary to have a OpenAlex Premium key for testing.

But if you have an idea, I would be happy to test.

yjunechoe commented 11 months ago

This is an interaction between {progress} and {parallel}. We use {progress} to print the progress bar, and the progress bar is stateful - I don't know the internals of {parallel}, but my suspicion is that you have a race condition with each thread updating the same progress state.

I think this should go away if you disable the progress bar, but now I also realize that oa_request() still creates a progress object even with verbose = FALSE. Maybe this is trivial but - @trangdata was there a reason why the progress bar's creation is outside the verbose if-clause?

https://github.com/ropensci/openalexR/blob/32855b6a4a1ca64468a7119513b0d7f275c0e24e/R/oa_fetch.R#L369-L378

trangdata commented 11 months ago

@yjunechoe you're right. oa_progress should be inside the if clause.

rkrug commented 11 months ago

Thanks for looking into this - I will try it out as soon as it is changed.

trangdata commented 10 months ago

So it looks like oa_progress is actually in some other functions outside of verbose, such as oa_ngrams. Should we wrap it in an if (verbose){} clause @yjunechoe?

yjunechoe commented 10 months ago

Yeah I think that'd be safest!

rkrug commented 10 months ago

Unfortunately, this did not solve the issue. I installed from github It it still crashes:

r$> parallel::mclapply(1:10, function(x){oa_request(oa_query("biodiversity"), count_only = TRUE, verbose = FALSE)})
objc[9825]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called.
objc[9824]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called.
objc[9825]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.
objc[9824]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.

Just to be sure, I used debugonce(openalexR:::oa_progress) before running one core, and it did not go into that function. So the problem must be somewhere else.

rkrug commented 10 months ago

OK - the problem is upstream in http:

library(https)
parallel::mclapply(1:2, function(x){httr::GET("http://google.com/", path = "search")})

and it is independent of https://community.rstudio.com/t/running-parallel-on-mac/142580/6 (although I don't know if it only affect M1 Macs). I filed a bug at https://github.com/r-lib/httr/issues/749.

I do not know if the error occurs on Intel Macs, Windows or Linux - I have a M1 Mac.

It also occurs in httr2, which superseded httr

r$> library(httr2)
    req <- httr2::request("http://google.com")
    parallel::mclapply(1:2, function(x){httr2::req_perform(req)})
objc[50637]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called.
objc[50637]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.
objc[50638]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called.
objc[50638]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.
[[1]]
NULL

[[2]]
NULL

Warning message:
In parallel::mclapply(1:2, function(x) { :
  scheduled cores 1, 2 did not deliver results, all values of the jobs will be affected