Open goldenstein64 opened 3 months ago
Thank you for the test case to reproduce the issue. I ran into the same issue some time ago and even posted my example to the SQLite forum, but given that cosmo implementation on Windows is different from the "original" sqlite one (as it's using the same code that runs on unix), there wasn't much response.
I can provide an additional case that may help with troubleshooting. Try commenting out the removal of the file on line one and then the "close" call will make all the difference between successful and failing cases:
--unix.rmrf('some.db')
do
local ddl_db = assert(lsqlite3.open('some.db'))
ddl_db:exec('PRAGMA journal_mode=WAL')
-- ddl_db:close() -- succeeds without close() and fails with close() call
end
local pid = assert(unix.fork()) -- run this in 2 processes
local pname = pid == 0 and 'child:' or 'parent:'
local db = assert(lsqlite3.open('some.db'))
-- make the parent sleep for one second
if pid ~= 0 then
unix.nanosleep(1)
end
print(pname, db:exec('PRAGMA optimize')) -- execute a write operation (I think, still blocks)
print(pname, db:error_message())
-- if executed successfully, sleep for 15 seconds
if db:error_code() == lsqlite3.OK then
unix.nanosleep(15)
end
db:close()
print(pname, "done")
if pid ~= 0 then unix.wait(pid) end -- added to reap zombies
unix.exit()
Somehow calling close()
affects the structures that the process has before forking, which "breaks" the locking mechanism. I tried to troubleshoot it some time ago and even compare with the same code that succeeds on Linux, but couldn't find the culprit.
I'm open to ideas/suggestions, as it affects one of my projects.
You're getting an SQLITE_PROTOCOL
error, which points to SQLite being blocked on a WAL lock that SQLite never expects to be held for very long (so the busy handler is not involved).
See this which is called in a loop here. I'd look for places where WAL_RETRY
is returned.
If this is happening when opening a new connection, I suspect the DMS lock might be involved. The Windows and Unix implementations are very different. Trying to do Windows by translating Unix syscalls is likely to introduce races here.
I think the bug got fixed in 3.0.0? The repro now prints the expected output.
Although I should note that the some.db
file had to be regenerated, which may be an issue for people who are adapting a redbean 2.2 project (which wouldn't be working great anyway since this bug would manifest in them already).
Nevermind, I just realized that ddl_db:close()
was commented out in my version of the repro file... oops
See this which is called in a loop here. I'd look for places where WAL_RETRY is returned. If this is happening when opening a new connection, I suspect the DMS lock might be involved. The Windows and Unix implementations are very different. Trying to do Windows by translating Unix syscalls is likely to introduce races here.
@ncruces, thank you for the details. Do you have any further suggestions on what to check or how to troubleshoot this? It does work for me without closing the DB, so there is something that happens as part of this process that affects the subsequent processing, but I so far has not been able to find what that may be.
No. I will note that what you're trying to do here is fraught with issues.
I hope you've read sqlite.org/howtocorrupt.html, especially the entire section 2; also this.
Contact Details
goldnstein64@gmail.com
What happened?
If two processes are connected to the same database with WAL enabled, a write operation on one connection will block all other connections until the writer disconnects.
Here is a reproduction of the issue along with its output, executed with
redbean -F FILE -i
The
parent:
messages appear after ~13 seconds. Making the write explicitly finish withBEGIN TRANSACTION
/END TRANSACTION
orCOMMIT
doesn't alter the program's behavior. A more obvious repro would probably be creating a table in theddl_db
block and having both processes insert values into the table.The expected output for this program would likely be:
Version
redbean 2.2
What operating system are you seeing the problem on?
Windows
Relevant log output