Bug: secondary redbean SQLite WAL connections lock after first write

goldenstein64 commented 3 months ago

Contact Details

goldnstein64@gmail.com

What happened?

If two processes are connected to the same database with WAL enabled, a write operation on one connection will block all other connections until the writer disconnects.

Here is a reproduction of the issue along with its output, executed with redbean -F FILE -i

unix.rmrf('some.db')

do
  local ddl_db = assert(lsqlite3.open('some.db'))
  ddl_db:exec('PRAGMA journal_mode=WAL')
  ddl_db:close()
end

local pid = assert(unix.fork()) -- run this in 2 processes
local pname = pid == 0 and 'child:' or 'parent:'
local db = assert(lsqlite3.open('some.db'))

-- make the parent sleep for one second
if pid ~= 0 then
  unix.nanosleep(1)
end

print(pname, db:exec('PRAGMA optimize')) -- execute a write operation (I think, still blocks)
print(pname, db:error_message())

-- if executed successfully, sleep for 15 seconds
if db:error_code() == lsqlite3.OK then
  unix.nanosleep(15)
end
db:close()
unix.exit()

child:  0
child:  not an error
parent: 15
parent: locking protocol

The parent: messages appear after ~13 seconds. Making the write explicitly finish with BEGIN TRANSACTION / END TRANSACTION or COMMIT doesn't alter the program's behavior. A more obvious repro would probably be creating a table in the ddl_db block and having both processes insert values into the table.

The expected output for this program would likely be:

child:  0
child:  not an error
parent: 0
parent: not an error

Version

redbean 2.2

What operating system are you seeing the problem on?

Windows

Relevant log output

> ./redbean.com -F test.lua -i --strace
SYS  29048            714'348 bell system five system call support 430 magnums loaded on the new technology
SYS  29048            808'858 __morph_begin()
SYS  29048          1'433'336 __morph_end()

> ./redbean.com -F test.lua -i --ftrace
error: --ftrace failed to open symbol table
child:  0
child:  not an error
parent: 15
parent: locking protocol

pkulchenko commented 3 months ago

Thank you for the test case to reproduce the issue. I ran into the same issue some time ago and even posted my example to the SQLite forum, but given that cosmo implementation on Windows is different from the "original" sqlite one (as it's using the same code that runs on unix), there wasn't much response.

I can provide an additional case that may help with troubleshooting. Try commenting out the removal of the file on line one and then the "close" call will make all the difference between successful and failing cases:

--unix.rmrf('some.db')

do
  local ddl_db = assert(lsqlite3.open('some.db'))
  ddl_db:exec('PRAGMA journal_mode=WAL')
--  ddl_db:close() -- succeeds without close() and fails with close() call
end

local pid = assert(unix.fork()) -- run this in 2 processes
local pname = pid == 0 and 'child:' or 'parent:'
local db = assert(lsqlite3.open('some.db'))

-- make the parent sleep for one second
if pid ~= 0 then
  unix.nanosleep(1)
end

print(pname, db:exec('PRAGMA optimize')) -- execute a write operation (I think, still blocks)
print(pname, db:error_message())

-- if executed successfully, sleep for 15 seconds
if db:error_code() == lsqlite3.OK then
  unix.nanosleep(15)
end
db:close()

print(pname, "done")
if pid ~= 0 then unix.wait(pid) end -- added to reap zombies

unix.exit()

Somehow calling close() affects the structures that the process has before forking, which "breaks" the locking mechanism. I tried to troubleshoot it some time ago and even compare with the same code that succeeds on Linux, but couldn't find the culprit.

I'm open to ideas/suggestions, as it affects one of my projects.

ncruces commented 3 months ago

You're getting an SQLITE_PROTOCOL error, which points to SQLite being blocked on a WAL lock that SQLite never expects to be held for very long (so the busy handler is not involved).

See this which is called in a loop here. I'd look for places where WAL_RETRY is returned.

If this is happening when opening a new connection, I suspect the DMS lock might be involved. The Windows and Unix implementations are very different. Trying to do Windows by translating Unix syscalls is likely to introduce races here.

goldenstein64 commented 1 month ago

I think the bug got fixed in 3.0.0? The repro now prints the expected output.

Although I should note that the some.db file had to be regenerated, which may be an issue for people who are adapting a redbean 2.2 project (which wouldn't be working great anyway since this bug would manifest in them already).

goldenstein64 commented 1 month ago

Nevermind, I just realized that ddl_db:close() was commented out in my version of the repro file... oops

pkulchenko commented 1 month ago

See this which is called in a loop here. I'd look for places where WAL_RETRY is returned. If this is happening when opening a new connection, I suspect the DMS lock might be involved. The Windows and Unix implementations are very different. Trying to do Windows by translating Unix syscalls is likely to introduce races here.

@ncruces, thank you for the details. Do you have any further suggestions on what to check or how to troubleshoot this? It does work for me without closing the DB, so there is something that happens as part of this process that affects the subsequent processing, but I so far has not been able to find what that may be.

ncruces commented 1 month ago

No. I will note that what you're trying to do here is fraught with issues.

I hope you've read sqlite.org/howtocorrupt.html, especially the entire section 2; also this.

jart / cosmopolitan