Closed DavisVaughan closed 1 month ago
This theory is further supported by the fact that @dfalbel thinks the test failures started after https://github.com/posit-dev/ark/pull/374/, where the tests for Data Explorer used to run sequentially, but were then broken up into smaller chunks that run in parallel. This likely causes them to interact with each other badly.
He also notes that he was able to use this to prove that if you force it to run sequentially then all of the problems magically go away
cargo test --test data_explorer -- --test-threads=1
I tried re-adding all tests to a single test call in https://github.com/posit-dev/ark/tree/bugfix/execute-de-tests-serially but it didn't work. It actually just takes a little longer to crash, so probably running tests in parallel make the problem worst, but the problem is still there as the data explorer service is launching r tasks from a different thread.
Is this issue being actively worked? If so, can someone take ownership and move state?
I'll let @DavisVaughan decide if https://github.com/posit-dev/ark/pull/465 is enough for this. But AFAICT the scope here is larger, so I think the current status (backlog for the 2024.09 milestone) is correct.
Current state is correct. It's a high priority post conf
Daniel's fix in https://github.com/posit-dev/ark/pull/465 seems to have been enough to stop the sporadic test failures for the most part, so this is no longer quite so high priority.
But we still are definitely executing r_task()
s off the R main thread, and that isn't great. So we still would like to create an RMain
that lives on the main R thread and can be used for testing purposes when we come back to this.
Edit: Sorry @DavisVaughan, I missed the core of your diagnostic: r_task()
is called from another thread than the thread where tests are executed, and there is no synchronisation at all in that specific case.
But I think we don't need RMain
to fix things here, we just need a bit more synchronisation of threads.
it looks like we only mitigate the problem by reducing the number of concurrent threads, so the problem must still be there.
It is definitely still there!
Related to https://github.com/posit-dev/positron/issues/2791
We are currently having sporadic failures in our Ark tests related to the Data Explorer https://github.com/posit-dev/ark/actions/runs/10219354910/job/28277437765#step:6:529
I believe this is only a testing problem, but it is still high priority to fix, because sporadic failures are quite bad. I've marked this as one of the few things we definitely want to look at right after conf.
The failure has been somewhat random, but always in the data explorer. The easiest way to reproduce is to run this many times locally
A failure looks something like one of these
Below is my best guess as to what is happening
Normally when you run an
r_task()
we send it to the main R thread throughRMain
, the task is run on the main R thread, and is then sent back to whatever thread called it.But when we are running tests, there isn't a real
RMain
available, so instead we have this escape hatch that we set during tests https://github.com/posit-dev/ark/blob/35215ac5309739840e372bbd79f4662fe43517f3/crates/ark/src/r_task.rs#L149That just runs any
r_task()
immediately on whatever the current thread is. This normally works fine because the thread we run the test in (i.e. the threadr_task()
is called from) is typically the main R thread too.But this isn't the case for the data explorer tests. The data explorer starts up its own data-explorer specific thread, and that thread is calling
r_task()
s, i.e. we aren't on the same thread as R anymore. This definitely opens the door for weirdness that would result in obscure errors like the ones we are seeing.https://github.com/posit-dev/ark/blob/35215ac5309739840e372bbd79f4662fe43517f3/crates/ark/src/data_explorer/r_data_explorer.rs#L174-L176
The solution is to somehow create a "real enough"
RMain
that we can use for testing that always lives on the main R thread, and that we can sendr_task()
s too. This will bring us closer to what is actually happening in production too, which would be nice. It should also open the door to handle https://github.com/posit-dev/positron/issues/2791.