Selecting a different DB doesn't release the cache directory lock on the original DB

henrymercer commented 4 years ago

Describe the bug Selecting a new DB doesn't release the cache directory lock on the originally selected DB, causing query runs on the originally selected DB from the CLI to fail.

To Reproduce

Add two databases, database A and database B to the databases view in the CodeQL extension using the "+" button
Select database A
Run a query on database A
Select database B
Using the CodeQL CLI run a query on database A
Observe that database A is still locked:

A fatal error occurred: Error initializing the IMB disk cache: the cache directory is already locked by another running process. Only one instance of the IMB can access a cache directory at a time. The lock file is located at /Users/henry/databases/ppppbn_boardgame_ff33297/db-javascript/default/cache/.lock (eventual cause: OverlappingFileLockException)

Expected behavior Selecting database B should unlock the cache directory of database A, so that the query from the CLI succeeds.

Additional context The cache directory of database A will be unlocked when a query is run from the extension on database B.

jcreedcmu commented 4 years ago

So do we want additional functionality from the cli to fix this? Does actually running a query on the newly selected database release the old db's lock and acquire the new one?

henrymercer commented 4 years ago

So do we want additional functionality from the cli to fix this?

I'm unfamiliar so am not sure whether this would require additional queryserver functionality.

Does actually running a query on the newly selected database release the old db's lock and acquire the new one?

Once a query has been run from the newly selected database, running a query on the originally selected database from the CodeQL CLI succeeds. The lockfile seems to be preserved, however.

aeisenberg commented 4 years ago

OK. Reproduced your problem. If there is a cli or query server command to remove a process's lock file, that would be what we need. Though, I don't think there is one.

@hmakholm do you have any suggestions on what we can do here?

hmakholm commented 4 years ago

The lock ought to be dropped automatically when the MemoryBackend that refers to the old database is closed. ~If the problem can be reproduced without provoking any particular evaluator error first, it must be that somehow the MemoryBackend doesn't get closed when we change database in the query server -- which is bad for a whole range of reasons.~

hmakholm commented 4 years ago

On the other hand, it doesn't look like the code that creates MemoryBackends ought to be capable of not closing the old backend. ~The alternative would be that FileSystemLock#release() somehow doesn't work.~

Which platform are you reproducing on?

aeisenberg commented 4 years ago

I'm on mac.

aeisenberg commented 4 years ago

Would it make sense for the extension to explicitly delete the lock file when databases change? I guess it would require some sleuthing because the lock file name looks partially generated.

aeisenberg commented 4 years ago

And to be clear, when the extension changes databases, it doesn't actually do anything from the cli/queryserver perspective. It is only updating its internal state so that it can handle future queries.

hmakholm commented 4 years ago

And to be clear, when the extension changes databases, it doesn't actually do anything from the cli/queryserver perspective. It is only updating its internal state so that it can handle future queries.

Ah, that's it; I see what you mean. I'll cancel my panic over something potentially being wrong with the locking implementation. :grin:

Yes I think we should add a "select database" or "database changed" command in the query server protocol so it can close its MemoryBackend eagerly when that happens.

@alexet, you have a better knowledge of the protocol -- is there a way to add a new command in a backward-compatible way, or would we need to have the extension vary its behavior based on the CLI version?

(Note that this would not entail deleting the lock file, but simply relinquishing its OS-level exclusive access to the file. This is safer than having the mere presence of the file mean "locked", because the OS will implicitly drop the lock when the owning process terminates for any reason, so it should be impossible to end up with a stale lock after a process is killed).

alexet commented 4 years ago

The problem with my initial implementation of the command was that due to the queue it was very hard to get a visual indication of whether a database was locked. The following sequence was problematic.

User selects database
user runs a query
query starts compiling
user selects another database
message was sent to close the database but it was already closed.
query finishes compiling
run command is now sent which opens the database.
command finishes but database is kept open in anticipation of new commands.

The protocol was designed to be externally stateless as most of the bugs with the older protocol were to do with disagreement about the state. It was mostly fine when everything was synchronous and serialised but the ability to run more than one query at once caused problems. However problems like these make me wonder if we should have had a more stateful design where the onus would be on the client to ensure that we don't change database while a query is running.

Note that due to the parallelism you can end up doing things like run two queries on two different databases and they end up running in the opposite order to what you ran (because we compile queries in parallel but we avoid opening multiple databases). This makes the system much harder to design.

I am tempted for vscode to run in a mode that always closes the database instantly. This would be a major performance hit but would avoid these problems.

For adding a new command you can kind-of just run the command and get an error back for a missing command but it is probably easier to depend on the cli version.

hmakholm commented 4 years ago

Note that due to the parallelism you can end up doing things like run two queries on two different databases and they end up running in the opposite order to what you ran (because we compile queries in parallel but we avoid opening multiple databases).

Hmm, won't that mean that we have multiple active MemoryBackends each with their own MemoryManager that thinks it can use all the configured space? That doesn't sound good.

alexet commented 4 years ago

No we avoid that, we just compile the queries in parallel before obtaining the backend lock. So whichever query compiles first claims the backend lock which may not be in the order which the user ran the queries.

adityasharad commented 4 years ago

However problems like these make me wonder if we should have had a more stateful design where the onus would be on the client to ensure that we don't change database while a query is running.

A point in defence of your stateless design: we did put this onus on the client in the Visual Studio extension, and I remember it was really hard to maintain the client-side database locking, so we moved the onus to the query server instead.

hmakholm commented 4 years ago

A possible design might be to add messages that let the extension inform the query server of what the currently selected database is. Then the query server would shut down a backend when it (a) is idle, and (b) doesn't match what we have been told is the selected database. Everything else stays as it is now, so if we lose synchronization between the two ends, the behavior degrades gracefully.

aeisenberg commented 4 years ago

So, the extension just needs to message the query server any time the database changes?

And if a request comes in from the cli for a different database, then the lock file will be transparently removed if the query server is idle?

I think that makes sense from my perspective.

hmakholm commented 4 years ago

So, the extension just needs to message the query server any time the database changes?

Well yes, and the query server would need to implement its end of the design :smile:

jcreedcmu commented 4 years ago

Still blocking on lock-releasing or cache-clearing functionality in cli.

hmakholm commented 4 years ago

We should probably discuss at the team meeting tomorrow what the plan is -- I have the impression everybody is waiting for someone else to pick this up.

nickrolfe commented 4 years ago

This has been a minor frustration for me in my recent work, so I'd be delighted if you could fix it.

One additional pain point that I didn't see mentioned: for Windows users, the open lock also prevents you from deleting the database directory.

lcartey commented 4 years ago

This has also been a frustration for me recently as well, particularly with respect to tests:

1) Run a test which fails 2) Use the generated test database 3) Fix the query 4) Switch to a different database 5) Re-run the test

Which then fails.

aeisenberg commented 3 years ago

OK. The solution in #681 is a partial fix. Now when a database is removed from the workspace, the lock is removed as well.

I think we need to add a third query server command (in addition to register and deregister databases), called unlock database. It will be similar to deregister in that the database will be closed, but it will not actually remove the database from the allow-list.

aeisenberg commented 3 years ago

After discussion with @hmakholm and @alexet we've decided that we will not explicitly try to solve this workflow. The workflow that we are solving is that we want to guarantee that if you remove a database from vscode, you can always run a query on it.

If you don't remove a db from vscode, the lock may be released under certain circumstances, but this is not a workflow we are encouraging or documenting.

github / vscode-codeql

Selecting a different DB doesn't release the cache directory lock on the original DB #249