Occasional VM crash due to race condition between deferred prepare and garbage collection

scouten commented 7 years ago

This is admittedly an unlikely condition in real-world usage, but I have observed an occasional BEAM crash in the esqlite NIF code when the following sequence of events occurs in quick succession:

A database is created using esqlite:open/4.
A statement is prepared using esqlite:prepare/4.
The database resource gets garbage collected.

I added some debug logging to the native code which suggests the above sequence can (though doesn't always) expand into the following sequence:

(Scheduler thread) Database created. It creates the command-processing thread.
(Scheduler thread) esqlite_prepare is called. It queues up a cmd_prepare event.
(Scheduler thread) destruct_esqlite_connection is called. It queues up a cmd_stop event and waits for the command-processing thread to stop.
(Command processor thread) Receives the cmd_prepare event and calls do_prepare.
(Command processor thread) do_prepare calls enif_keep_resource on the database connection resources and subsequently returns a reference to a statement resource.
(Command processor thread) Processes cmd_stop event and exits.
(Scheduler thread) destruct_esqlite_connection is unblocked and finishes tearing down data structures.
(Scheduler thread) destruct_esqlite_statement gets called. It calls enif_release_resource on the database resource.
(Scheduler thread) destruct_esqlite_connection gets called a second time for the same resource cited in step #3 above. The data structures were released at step 7 so the pointers are now invalid. Crash with an assertion failure or segfault.

Here's an example from a recent test run:

esqlite_start
  conn = 0000000014902e60
  queue = 0000000014d800b0

esqlite_prepare called on conn 0000000014902e60
do_prepare
  conn = 0000000014902e60
  sql = PRAGMA foreign_keys

destruct_esqlite_connection
  conn = 0000000014902e60
do_prepare calling keep_resource on conn 0000000014902e60
destruct_esqlite_statement calling release_resource on conn 0000000014902e60

destruct_esqlite_connection
  conn = 0000000014902e60
Assertion failed: (queue->message == NULL && "Attempting to send multiple messages."), function queue_send, file c_src/queue.c, line 173.
Abort trap: 6

Note that segfaults and other assertions have been observed as well, presumably because the precise failure mode can vary depending on how the data structures have been altered since being released.

Also: Line numbers may be slightly different from current master branch because I've added instrumentation.

The crash happens about 20-30% of the time if I run the test suite for https://github.com/scouten/sqlite_ecto2/pull/29. I'm attempting to derive a simpler test case for it, but haven't succeeded in stripping it down yet. If I do, I'll add that as a comment here.

I may also attempt a fix, but NIFs are relatively new to me.

scouten commented 7 years ago

Credit to http://erlang.org/pipermail/erlang-questions/2011-February/056282.html for suggestions on how this scenario might happen.

scouten commented 7 years ago

/cc @fishcakez @tchoutri: If you know of anyone who could help propose a fix, this would be a big help to the sqlite_ecto2 effort (on which I've made quite a bit of progress lately).

scouten commented 7 years ago

I think that this could be made more reproducible by introducing an artificial delay in the do_prepare function and killing the OTP thread during that delay. Out of time to experiment with this for today, but I'll try that approach when I can look into this again.

mmzeeman commented 7 years ago

Good one. I'll take a look at it.

Did you use the latest version? About a month ago I fixed another crash which happened in a similar way but without prepared statements.

scouten commented 7 years ago

Well, it turns out I wasn't using the latest version. I was using the 0.2.1 release from hex.pm, which corresponds to June 2015. I updated to your latest. On the third test run, it failed with a new error:

/include/internal/ethr_mutex.h:656: Fatal error in ethr_mutex_lock(): Invalid argument (22)
Abort trap: 6

scouten commented 7 years ago

https://github.com/scouten/sqlite_ecto2/pull/41 bumps the version reference to your current master branch.

Ignore the errors about Ecto.Integration.TransactionTest.UniqueError being raised. That's a known and unrelated issue.

scouten commented 7 years ago

FWIW I've run it about 20 more times and have not seen this error recur. My suspicion is that the December 2016 fix made the race condition more difficult to reach, but not impossible.

If I read that code correctly, at https://github.com/mmzeeman/esqlite/commit/3d546ff8a46164724ac3018af8ae17b9d7498a5c#diff-78ab23ac1ccd798d73af2a75e4b49fbeR227, you are waiting until the queue is empty before destroying the queue. However, there is still the possibility that the command processing thread would pop the prepare command (https://github.com/mmzeeman/esqlite/blob/master/c_src/esqlite3_nif.c#L676) and then the Erlang scheduler thread would clear the remaining commands (including cmd_stop) while the do_prepare command is still running in the command processing thread.

scouten commented 7 years ago

I've run this ~100 times since (great entertainment during a meeting!) and have only seen it crash once (the ethr_mutex_lock assertion failure I mentioned earlier). I'll merge this change since it seems to improve things, but am still concerned that it's not quite 100% stable.

Thanks for your attention!

scouten commented 7 years ago

100x success on my machine. And then two builds failed on Travis with segfault: https://travis-ci.org/scouten/sqlite_ecto2/builds/203920437

mmzeeman commented 7 years ago

Thanks for reporting back. I hope to find some time in the evening to check out the mutex problem.

mmzeeman commented 7 years ago

I think a reference to the connection should be saved before the prepare event is sent. This should prevent the destruction of the connection while there are prepare events on the queue. It could be that I need to add some extra locking to prevent this from happening.

BTW, where is the code which triggers this race-condition?

scouten commented 7 years ago

@mmzeeman The code that triggers this is in the branch for https://github.com/scouten/sqlite_ecto2/pull/41. As mentioned earlier, I've tried but not yet succeeded in developing a narrower test harness.

A reference to the connection is being saved. What I think is happening is that the OTP process that owns that reference is itself getting killed at an inopportune moment, thus allowing the destruction to happen in an undesirable sequence.

mmzeeman commented 7 years ago

The extra reference to the connection is now made in the command processor thread. It could be that the connection is garbage collected and the stop event is put on the queue... If that happens first the connection is taken down and then later the prepared statement is constructed, and destructed again. This is not how it should work.

If the reference to the connection is made before it is put on the queue, the connection can't be garbage collected while there is a prepare event on the queue. I think this should fix the problem. The prepared statements will then always be destructed before the connection is destructed.

scouten commented 7 years ago

That sounds right. I'll take a look when I can, probably this evening (US Pacific time).

scouten commented 7 years ago

Still seeing a crash, unfortunately. I think this may be a different issue that was unmasked by the previous fix.

I believe there's a problem with the call to enif_thread_join at line 220 of esqlite3_nif.c. I'm sort of baffled by this finding; I would have expected that call to be battle-tested and robust, but I'll show you how I got to that conclusion.

I built a stripped-down version of my test harness to demonstrate. For purposes of this comment, the changes I made at https://github.com/scouten/crash_esqlite_case/commit/d800062eaf7455cacd808d513c6c151e3c2d65e7?w=1 (sorry for whitespace noise; my editor strips trailing space by default and I didn't catch it) will be important. You may find those useful in debugging. Basically I've added console output to all of the calls relevant to the conn and queue data structures so I can observe how the two threads interact.

(If you don't mind installing Elixir 1.3.x or later, the try-to-crash.sh script at the root of that repo is pretty effective. It re-runs the test script 100 times, but almost always crashes within the first 10 iterations on my machine.)

What follows is a transcript of a test run on my machine that crashed. There were many other threads involved but I've stripped this down to the calls relevant to the thread pair that crashed:

esqlite_start conn = 000000001e4077b0 queue = 000000001e8800b0
esqlite_connection_run conn = 000000001e4077b0 queue = 000000001e8800b0 command thread START
queue_pop q = 000000001e8800b0 calling enif_mutex_lock
esqlite_start conn = 000000001e4077b0 queue = 000000001e8800b0 command thread 000000001e8801e0 created
queue_push q = 000000001e8800b0 calling enif_mutex_lock
queue_pop q = 000000001e8800b0 calling enif_mutex_lock
queue_push q = 000000001e8800b0 calling enif_mutex_lock
queue_pop q = 000000001e8800b0 calling enif_mutex_lock
esqlite_prepare calling keep_resource on conn 000000001e4077b0
queue_push q = 000000001e8800b0 calling enif_mutex_lock
do_prepare conn = 000000001e4077b0 sql = PRAGMA foreign_keys
do_prepare conn = 000000001e4077b0 DONE
destruct_esqlite_statement release_resource conn = 000000001e4077b0
destruct_esqlite_connection conn = 000000001e4077b0 queue = 000000001e8800b0
queue_push q = 000000001e8800b0 calling enif_mutex_lock
destruct_esqlite_connection conn = 000000001e4077b0 queue = 000000001e8800b0 waiting for cmd thread 000000001e8801e0 to finish
   # --- point A
destruct_esqlite_connection conn = 000000001e4077b0 queue = 000000001e8800b0 draining remaining commands
queue_has_item q = 000000001e8800b0 calling enif_mutex_lock
queue_pop q = 000000001e8800b0 calling enif_mutex_lock
queue_has_item q = 000000001e8800b0 calling enif_mutex_lock
destruct_esqlite_connection conn = 000000001e4077b0 queue = 000000001e8800b0 queue is empty
queue_destroy q = 000000001e8800b0 calling enif_mutex_lock
destruct_esqlite_connection conn = 000000001e4077b0 queue = 000000001e8800b0 queue destroyed
destruct_esqlite_connection conn = 000000001e4077b0 DONE
queue_pop q = 000000001e8800b0 calling enif_mutex_lock
   # --- point B
../include/internal/ethr_mutex.h:656: Fatal error in ethr_mutex_lock(): Invalid argument (22)
./try-to-crash.sh: line 8: 16231 Abort trap: 6           mix test --no-deps-check

At point A, I would have expected to see esqlite_connection_run conn = 000000001e4077b0 queue = 000000001e8800b0 command thread EXIT but it didn't happen. Note that I do see that message, as expected, in test runs that do not crash. I also see it in the crashing test runs, but not for the thread pair that crashes.

At point B, it appears that the command processing thread apparently continued to run and called queue_pop with a now-dangling pointer and crashed.

From this report I believe that enif_thread_join incorrectly returns before the command-processing thread has actually terminated. In the documentation for erl_drv_thread_join (for which enif_thread_join is an alias), I do note the following warning ...

A thread can only be joined once. The behavior of joining more than once is undefined, an emulator crash is likely.

... but I don't have any evidence that there is an extra join call happening.

FWIW, my system information:

Erlang/OTP 19 [erts-8.2] [source] [64-bit] [smp:8:8] [async-threads:10] [hipe] [kernel-poll:false] [dtrace]

Elixir 1.3.4

Thanks for your attention so far and hope all of this information helps!

mmzeeman commented 7 years ago

Thanks for the detailed information. This is a really strange result indeed. I don't have any experience with elixer though. I will try to replicate the problem in plain erlang.

Could you printf to stderr instead of stdout? Sometimes stdout can be buffered, that sometimes causes printf statements to come out in the wrong order.

mmzeeman commented 7 years ago

I installed elixer and can now see crashes. I've upgraded sqlite, and inspecting core dumps. It almost looks like it is an issue inside sqlite itself.

Thread 29:: esqlite_connect
0   libsystem_kernel.dylib          0x00007fffd203ec86 __psynch_cvwait + 10
1   beam.smp                        0x00000000154698f6 ethr_cond_wait + 6
2   esqlite3_nif.so                 0x00000001b0cce2c9 esqlite_connection_run + 89 (esqlite3_nif.c:694)

Thread 30 Crashed:: esqlite_connect
0   libsystem_pthread.dylib         0x00007fffd21254fc pthread_mutex_lock + 0
1   esqlite3_nif.so                 0x00000001b0ce9875 pthreadMutexEnter + 21 (sqlite3.c:23358)
2   esqlite3_nif.so                 0x00000001b0cd0d14 sqlite3_mutex_enter + 36 (sqlite3.c:22809)
3   esqlite3_nif.so                 0x00000001b0cdbfcd columnName + 109 (sqlite3.c:76920)
4   esqlite3_nif.so                 0x00000001b0cdbf54 sqlite3_column_name + 36 (sqlite3.c:76938)
5   esqlite3_nif.so                 0x00000001b0ccef3b do_column_names + 187 (esqlite3_nif.c:581)
6   esqlite3_nif.so                 0x00000001b0cce635 evaluate_command + 341 (esqlite3_nif.c:659)
7   esqlite3_nif.so                 0x00000001b0cce33b esqlite_connection_run + 203 (esqlite3_nif.c:701)

Thread 31:: esqlite_connect
0   esqlite3_nif.so                 0x00000001b0ce4930 setupLookaside + 352 (sqlite3.c:140082)
1   esqlite3_nif.so                 0x00000001b0ce6990 openDatabase + 1552 (sqlite3.c:142436)
2   esqlite3_nif.so                 0x00000001b0ce6376 sqlite3_open + 38 (sqlite3.c:142489)
3   esqlite3_nif.so                 0x00000001b0cce832 do_open + 162 (esqlite3_nif.c:272)
4   esqlite3_nif.so                 0x00000001b0cce537 evaluate_command + 87 (esqlite3_nif.c:645)
5   esqlite3_nif.so                 0x00000001b0cce33b esqlite_connection_run + 203 (esqlite3_nif.c:701)

mmzeeman commented 7 years ago

Aha... there is a reference counting problem to the connection. Finally I have a minimal crash example.

maas$ erl -pa ebin -pa deps/*/ebin
Erlang/OTP 18 [erts-7.3] [source-d2a6d81] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V7.3  (abort with ^G)
1> {ok, Db1} = esqlite3:open(":memory:").
{ok,{connection,#Ref<0.0.4.31>,<<>>}}
2> esqlite3:q("create table test(one, two);", Db1).
[]
3> {ok, Stmt} = esqlite3:prepare("select * from test", Db1).
{ok,<<>>}
4> f(Db1).
ok
5> erlang:garbage_collect().
true
6> esqlite3:column_names(Stmt).
Segmentation fault: 11 (core dumped)

I expected that internally a reference to the connection was kep by enif_keep_resource in the prepare call. What happens in practice is that the connection is destructed while there are still prepared statements floating around.

scouten commented 7 years ago

That example sounds a lot like what I've been seeing. I'll take a look at this in the morning. Thank you again!

mmzeeman commented 7 years ago

I'm working on a fix... I want to store the connection variable together with the prepared statement in esqlite3.erl. Unfortunately I didn't have the time to finish it yesterday.

scouten commented 7 years ago

OK, no problem. Thanks for the update.

mmzeeman commented 7 years ago

I've pushed a fix (hopefully).

scouten commented 7 years ago

Fix looks great. Thank you very much!

Fixed by 3f1ef40b9011276eb8bdc366c5ef1e25d79befa5.

mmzeeman commented 7 years ago

Thanks for your patience... and detailed responses. That really helped.

scouten commented 7 years ago

I think I spoke too soon. Still seeing segfaults, albeit more rarely. Symptoms are new and different. I'm working on narrowing down a case and I'll file a new issue when I have something reproducible.

mmzeeman / esqlite

Occasional VM crash due to race condition between deferred prepare and garbage collection #33