posit-dev / positron

Positron, a next-generation data science IDE
https://positron.posit.co
Other
2.77k stars 85 forks source link

Ark: Tests involving `r_task()` run from spawned threads can sporadically fail #4222

Closed DavisVaughan closed 1 month ago

DavisVaughan commented 3 months ago

Related to https://github.com/posit-dev/positron/issues/2791

We are currently having sporadic failures in our Ark tests related to the Data Explorer https://github.com/posit-dev/ark/actions/runs/10219354910/job/28277437765#step:6:529

I believe this is only a testing problem, but it is still high priority to fix, because sporadic failures are quite bad. I've marked this as one of the few things we definitely want to look at right after conf.

The failure has been somewhat random, but always in the data explorer. The easiest way to reproduce is to run this many times locally

cargo test -p ark --test data_explorer

A failure looks something like one of these

``` ---- test_live_updates stdout ---- thread 'ark-data-viewer-obj-079049e2-5e93-456f-b3c1-8886409ee1f8' panicked at crates/harp/src/table.rs:103:14: called `Result::unwrap()` on an `Err` value: Error evaluating 'base::dim.data.frame(structure(list(y = c(0, 2, 1), z = c(4, 5, 6)), row.names = c(NA, -3L), class = "data.frame"))': R_Reprotect: only 13 protected items, can't reprotect index 13 R backtrace: 0. base::dim.data.frame(structure(list(y = c(0, 2, 1), z = c(4, 5, 6)), row.names = c(NA, -3L), class = "data.frame")) R thread backtrace: 0: std::backtrace_rs::backtrace::libunwind::trace at /rustc/051478957371ee0084a7c0913941d2a8c4757bb9/library/std/src/../../backtrace/src/backtrace/libunwind.rs:116:5 1: std::backtrace_rs::backtrace::trace_unsynchronized at /rustc/051478957371ee0084a7c0913941d2a8c4757bb9/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5 2: std::backtrace::Backtrace::create at /rustc/051478957371ee0084a7c0913941d2a8c4757bb9/library/std/src/backtrace.rs:331:13 3: harp::exec::try_catch::handler::{{closure}} at /Users/davis/files/programming/positron/ark/crates/harp/src/exec.rs:246:30 4: harp::exec::try_catch::handler at /Users/davis/files/programming/positron/ark/crates/harp/src/exec.rs:222:27 5: do_internal at /Volumes/Builds/R4/R-4.4.1/src/main/names.c:1409:11 6: Rf_eval at /Volumes/Builds/R4/R-4.4.1/src/main/eval.c:1237:12 7: R_execClosure at /Volumes/Builds/R4/R-4.4.1/src/main/eval.c:2398:22 8: applyClosure_core at /Volumes/Builds/R4/R-4.4.1/src/main/eval.c:2311:16 9: Rf_applyClosure at /Volumes/Builds/R4/R-4.4.1/src/main/eval.c:2333:16 10: bcEval_loop at /Volumes/Builds/R4/R-4.4.1/src/main/eval.c:8112:16 11: bcEval at /Volumes/Builds/R4/R-4.4.1/src/main/eval.c:7524:16 12: Rf_eval at /Volumes/Builds/R4/R-4.4.1/src/main/eval.c:1167:8 13: R_execClosure at /Volumes/Builds/R4/R-4.4.1/src/main/eval.c:2398:22 14: applyClosure_core at /Volumes/Builds/R4/R-4.4.1/src/main/eval.c:2311:16 15: Rf_applyClosure at /Volumes/Builds/R4/R-4.4.1/src/main/eval.c:2333:16 16: Rf_eval at /Volumes/Builds/R4/R-4.4.1/src/main/eval.c:1285:12 17: vsignalError at /Volumes/Builds/R4/R-4.4.1/src/main/errors.c:1852:3 18: Rf_errorcall at /Volumes/Builds/R4/R-4.4.1/src/main/errors.c:917:5 19: Rf_error at /Volumes/Builds/R4/R-4.4.1/src/main/errors.c:958:5 20: R_signal_reprotect_error at /Volumes/Builds/R4/R-4.4.1/src/main/memory.c:3482:5 21: R_LookupMethod 22: Rf_usemethod at /Volumes/Builds/R4/R-4.4.1/src/main/objects.c:500:8 23: Rf_DispatchOrEval at /Volumes/Builds/R4/R-4.4.1/src/main/eval.c:4226:9 24: do_length at /Volumes/Builds/R4/R-4.4.1/src/main/array.c:464:8 25: bcEval_loop at /Volumes/Builds/R4/R-4.4.1/src/main/eval.c:8141:14 26: bcEval at /Volumes/Builds/R4/R-4.4.1/src/main/eval.c:7524:16 27: Rf_eval at /Volumes/Builds/R4/R-4.4.1/src/main/eval.c:1167:8 28: R_execClosure at /Volumes/Builds/R4/R-4.4.1/src/main/eval.c:2398:22 29: applyClosure_core at /Volumes/Builds/R4/R-4.4.1/src/main/eval.c:2311:16 30: Rf_applyClosure at /Volumes/Builds/R4/R-4.4.1/src/main/eval.c:2333:16 31: Rf_eval at /Volumes/Builds/R4/R-4.4.1/src/main/eval.c:1285:12 32: libr::r::Rf_eval at /Users/davis/files/programming/positron/ark/crates/libr/src/functions.rs:31:21 33: harp::exec::try_eval::{{closure}} at /Users/davis/files/programming/positron/ark/crates/harp/src/exec.rs:105:41 34: harp::exec::try_catch::callback at /Users/davis/files/programming/positron/ark/crates/harp/src/exec.rs:209:31 35: R_withCallingErrorHandler at /Volumes/Builds/R4/R-4.4.1/src/main/errors.c:2579:16 36: libr::r::R_withCallingErrorHandler at /Users/davis/files/programming/positron/ark/crates/libr/src/functions.rs:31:21 37: harp::exec::try_catch::{{closure}} at /Users/davis/files/programming/positron/ark/crates/harp/src/exec.rs:281:9 38: harp::exec::top_level_exec::callback at /Users/davis/files/programming/positron/ark/crates/harp/src/exec.rs:351:31 39: R_ToplevelExec at /Volumes/Builds/R4/R-4.4.1/src/main/context.c:804:2 40: libr::r::R_ToplevelExec at /Users/davis/files/programming/positron/ark/crates/libr/src/functions.rs:31:21 41: harp::exec::top_level_exec at /Users/davis/files/programming/positron/ark/crates/harp/src/exec.rs:354:14 42: harp::exec::try_catch at /Users/davis/files/programming/positron/ark/crates/harp/src/exec.rs:280:20 43: harp::exec::try_eval at /Users/davis/files/programming/positron/ark/crates/harp/src/exec.rs:105:19 44: harp::exec::RFunction::call_in at /Users/davis/files/programming/positron/ark/crates/harp/src/exec.rs:97:9 45: harp::exec::RFunction::call at /Users/davis/files/programming/positron/ark/crates/harp/src/exec.rs:92:9 46: harp::table::df_dim at /Users/davis/files/programming/positron/ark/crates/harp/src/table.rs:100:20 47: harp::table::df_info at /Users/davis/files/programming/positron/ark/crates/harp/src/table.rs:69:20 48: harp::table::table_info at /Users/davis/files/programming/positron/ark/crates/harp/src/table.rs:30:16 49: ark::data_explorer::r_data_explorer::table_info_or_bail at ./src/data_explorer/r_data_explorer.rs:1140:5 50: ark::data_explorer::r_data_explorer::RDataExplorer::r_get_shape at ./src/data_explorer/r_data_explorer.rs:588:24 51: ark::data_explorer::r_data_explorer::RDataExplorer::update::{{closure}} at ./src/data_explorer/r_data_explorer.rs:341:41 52: ark::r_task::r_task at ./src/r_task.rs:150:16 53: ark::data_explorer::r_data_explorer::RDataExplorer::update at ./src/data_explorer/r_data_explorer.rs:341:31 54: ark::data_explorer::r_data_explorer::RDataExplorer::execution_thread at ./src/data_explorer/r_data_explorer.rs:251:31 55: ark::data_explorer::r_data_explorer::RDataExplorer::start::{{closure}} at ./src/data_explorer/r_data_explorer.rs:194:21 56: std::sys_common::backtrace::__rust_begin_short_backtrace at /rustc/051478957371ee0084a7c0913941d2a8c4757bb9/library/std/src/sys_common/backtrace.rs:155:18 57: std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}} at /rustc/051478957371ee0084a7c0913941d2a8c4757bb9/library/std/src/thread/mod.rs:542:17 58: as core::ops::function::FnOnce<()>>::call_once at /rustc/051478957371ee0084a7c0913941d2a8c4757bb9/library/core/src/panic/unwind_safe.rs:272:9 59: std::panicking::try::do_call at /rustc/051478957371ee0084a7c0913941d2a8c4757bb9/library/std/src/panicking.rs:559:40 60: ___rust_try 61: std::panicking::try at /rustc/051478957371ee0084a7c0913941d2a8c4757bb9/library/std/src/panicking.rs:523:19 62: std::panic::catch_unwind at /rustc/051478957371ee0084a7c0913941d2a8c4757bb9/library/std/src/panic.rs:149:14 63: std::thread::Builder::spawn_unchecked_::{{closure}} at /rustc/051478957371ee0084a7c0913941d2a8c4757bb9/library/std/src/thread/mod.rs:541:30 64: core::ops::function::FnOnce::call_once{{vtable.shim}} at /rustc/051478957371ee0084a7c0913941d2a8c4757bb9/library/core/src/ops/function.rs:250:5 65: as core::ops::function::FnOnce>::call_once at /rustc/051478957371ee0084a7c0913941d2a8c4757bb9/library/alloc/src/boxed.rs:2063:9 66: as core::ops::function::FnOnce>::call_once at /rustc/051478957371ee0084a7c0913941d2a8c4757bb9/library/alloc/src/boxed.rs:2063:9 67: std::sys::pal::unix::thread::Thread::new::thread_start at /rustc/051478957371ee0084a7c0913941d2a8c4757bb9/library/std/src/sys/pal/unix/thread.rs:108:17 68: __pthread_joiner_wake note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace thread 'test_live_updates' panicked at crates/ark/tests/data_explorer.rs:1004:90: called `Result::unwrap()` on an `Err` value: Timeout failures: test_live_updates test result: FAILED. 13 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 1.45s ```
``` test test_invalid_filters_preserved ... ok Error: attempt to apply non-function Warning: stack imbalance in 'vapply', 41 then 48 Warning: stack imbalance in '<-', 37 then 44 *** caught segfault *** address 0x0, cause 'invalid permissions' Traceback: 1: format_traceback(calls, rust_like = TRUE) 2: try_catch_handler(list(message = "bad value", call = NULL)) 3: h(simpleError(msg, call)) 4: .handleSimpleError(function (cond) .Internal(C_tryCatchHelper(addr, 1L, cond)), "bad value", base::quote(NULL)) Possible actions: 1: abort (with core dump, if enabled) 2: normal R exit 3: exit R without saving workspace 4: exit R saving workspace Warning: stack imbalance in 'lazyLoadDBfetch', 47 then 46 Warning: stack imbalance in '<-', 44 then 43 Warning: stack imbalance in '{', 33 then 39 Warning: stack imbalance in '<-', 28 then 34 Warning: stack imbalance in '{', 24 then 30 Selection: Error: no more error handlers available (recursive errors?); invoking 'abort' restart Error: no more error handlers available (recursive errors?); invoking 'abort' restart ```

Below is my best guess as to what is happening


Normally when you run an r_task() we send it to the main R thread through RMain, the task is run on the main R thread, and is then sent back to whatever thread called it.

But when we are running tests, there isn't a real RMain available, so instead we have this escape hatch that we set during tests https://github.com/posit-dev/ark/blob/35215ac5309739840e372bbd79f4662fe43517f3/crates/ark/src/r_task.rs#L149

That just runs any r_task() immediately on whatever the current thread is. This normally works fine because the thread we run the test in (i.e. the thread r_task() is called from) is typically the main R thread too.

But this isn't the case for the data explorer tests. The data explorer starts up its own data-explorer specific thread, and that thread is calling r_task()s, i.e. we aren't on the same thread as R anymore. This definitely opens the door for weirdness that would result in obscure errors like the ones we are seeing.

https://github.com/posit-dev/ark/blob/35215ac5309739840e372bbd79f4662fe43517f3/crates/ark/src/data_explorer/r_data_explorer.rs#L174-L176

The solution is to somehow create a "real enough" RMain that we can use for testing that always lives on the main R thread, and that we can send r_task()s too. This will bring us closer to what is actually happening in production too, which would be nice. It should also open the door to handle https://github.com/posit-dev/positron/issues/2791.

DavisVaughan commented 3 months ago

This theory is further supported by the fact that @dfalbel thinks the test failures started after https://github.com/posit-dev/ark/pull/374/, where the tests for Data Explorer used to run sequentially, but were then broken up into smaller chunks that run in parallel. This likely causes them to interact with each other badly.

He also notes that he was able to use this to prove that if you force it to run sequentially then all of the problems magically go away

cargo test --test data_explorer -- --test-threads=1
dfalbel commented 3 months ago

I tried re-adding all tests to a single test call in https://github.com/posit-dev/ark/tree/bugfix/execute-de-tests-serially but it didn't work. It actually just takes a little longer to crash, so probably running tests in parallel make the problem worst, but the problem is still there as the data explorer service is launching r tasks from a different thread.

jonvanausdeln commented 3 months ago

Is this issue being actively worked? If so, can someone take ownership and move state?

dfalbel commented 3 months ago

I'll let @DavisVaughan decide if https://github.com/posit-dev/ark/pull/465 is enough for this. But AFAICT the scope here is larger, so I think the current status (backlog for the 2024.09 milestone) is correct.

DavisVaughan commented 3 months ago

Current state is correct. It's a high priority post conf

DavisVaughan commented 2 months ago

Daniel's fix in https://github.com/posit-dev/ark/pull/465 seems to have been enough to stop the sporadic test failures for the most part, so this is no longer quite so high priority.

But we still are definitely executing r_task()s off the R main thread, and that isn't great. So we still would like to create an RMain that lives on the main R thread and can be used for testing purposes when we come back to this.

lionel- commented 2 months ago

Edit: Sorry @DavisVaughan, I missed the core of your diagnostic: r_task() is called from another thread than the thread where tests are executed, and there is no synchronisation at all in that specific case.

But I think we don't need RMain to fix things here, we just need a bit more synchronisation of threads.

Executing off the main thread is unsafe in some cases (code that makes assumptions about which thread is running) but it's mostly safe if accesses to the R API are synchronised by a mutex, which is the case in unit tests (if they run within a `r_test() `context). The error reported above about a bookkeeping error in the protection stack suggests a concurrency issue where we are running code out of order, perhaps because we haven't been careful about makin sure protect and unprotect are paired correctly, or because some code is accessing the R API outside of an R task. Looking at Daniel's fix, it looks like we only mitigate the problem by reducing the number of concurrent threads, so the problem must still be there. Yesterday @kevinushey encountered a sporadic crash in the data explorer thread again (https://github.com/posit-dev/ark/actions/runs/10815578420/job/30004790701), which could have the same cause as the problems discussed here (hopefully): ``` process didn't exit successfully: `/home/runner/work/ark/ark/target/debug/deps/data_explorer-9c0079d3ec061369` (signal: 11, SIGSEGV: invalid memory reference) ```
DavisVaughan commented 2 months ago

it looks like we only mitigate the problem by reducing the number of concurrent threads, so the problem must still be there.

It is definitely still there!