TypeDB has troubles cancelling active queries

farost commented 1 month ago

Description

While executing long queries (sometimes unexpectedly long ones) from different clients (e.g. Studio or Console), a canceled query (X/lightning bolt in Studio or Ctrl + C in Console) is expected to be canceled from the user's point of view. However, if the query takes much time to execute, it seems not to register the cancel command and continue using the system resources.

We initially faced this issue while sending several long data read queries to a server with 2 CPUs through Studio (pressing the X button for each query and thinking that these queries would be canceled), resulting in the server's unavailability due to all its threads being busy with queries and ignorant of the connect attempts. After waiting for 20 minutes, we needed to restart the server.

The same can be reproduced in Console if we run multiple queries in parallel transactions or even Ctrl + C them without closing the transactions.

Interesting points of the investigation for now:

All the queries are canceled if we switch to schema read in Studio. The resources are immediately freed.
The query is canceled if we explicitly close the transaction (and potentially a session) in Console.
TypeDB Option transaction timeout seems to help in these situations after some tests, but there are still suspicions that it sometimes may not.
It looks like it's not reproduced in Console within one transaction, but it may be a special case of Studio to create unique transactions when running queries in the same tab without changing anything.
It can be reproduced on machines with large resources by limiting the MAX_THREAD number in the code, so it's a CPU-bound issue.

Right now it looks like a combination of multiple issues from the clients and the server, but the server should be more reliable in this question for sure, so it will be the main issue of the investigation for now.

I'm going to update this issue while investigating the initial case and trying to reproduce it with different setups and clients.

Environment

TypeDB distribution: Core (same for Cloud)
TypeDB version: 2.28.0
Environment: Mac (same for Google Cloud)
Client and version: Studio 2.28.0 + Console 2.28.0

Reproducible Steps

Set up Create a database and fill it with a sufficient amount of data. Write a data read query that can run for over 10 minutes. To reproduce it more easily, you can build the server locally and limit the MAX_THREADS number in the code to emulate a machine with a lower number of CPUs.
Execute Run queries multiple times (try to exceed the number of CPUs you have on the server) and try to cancel them: a) Studio: run -> X -> run -> X -> ... b) Console: open multiple tabs, create multiple transactions, run queries in parallel, and press Ctrl + C in each tab.
Unexpected result Even if the client (Studio or Console) say that the query is canceled, it is not. The resources are still taken and the query is being executed (there is still RocksDB work if we enter the debug mode).

Expected result

All the queries are canceled, and the work is stopped.

Additional logs

Here are logs from Studio and from Server when we initially hung up the cloud server by running multiple long queries and needed to restart it (the detailed analysis is in the comments).

Studio:

2024-05-13 17:40:07,923 [Thread-646] [ERROR] c.v.t.s.s.connection.SessionState - [CNX07] TypeDB Connection: Session was closed on TypeDB Server.
2024-05-13 18:05:12,651 [Thread-654] [ERROR] c.v.t.s.s.c.TransactionState - [CNX09] TypeDB Connection: Transaction was closed, due to:
[CXN05] The transaction is closed because of the error(s):
[QRY04] Invalid Query Pattern: Invalid query containing unbounded concept variable '$sub1'.
2024-05-13 18:45:08,234 [Thread-0] [INFO ] com.vaticle.typedb.studio.Studio - Closing TypeDB Studio

Server (all three nodes combined):

17:37:33.924 [typedb-scheduled::0] WARN com.vaticle.typedb.core.server.SessionService -- Session with ID 4b8a9ad3-72a3-4f62-93a0-14d89b29bdb9 timed out due to inactivity
17:41:07.442 [typedb-service::0] ERROR com.vaticle.typedb.core.server.TypeDBService -- [RPL01] Replica Error: The replica is not the primary replica.
com.vaticle.typedb.cloud.common.exception.TypeDBCloudException: [RPL01] Replica Error: The replica is not the primary replica.
17:42:06.417 [typedb-service::0] ERROR com.vaticle.typedb.core.server.TypeDBService -- [RPL01] Replica Error: The replica is not the primary replica.
com.vaticle.typedb.cloud.common.exception.TypeDBCloudException: [RPL01] Replica Error: The replica is not the primary replica.
17:44:07.883 [typedb-scheduled::0] ERROR com.vaticle.typedb.core.server.TransactionService -- [SRV29] Invalid Server Operation: Transaction exceeded maximum configured duration of '300' seconds.
17:44:37.882 [typedb-scheduled::0] WARN com.vaticle.typedb.core.server.SessionService -- Session with ID d09a5def-9b00-47a1-bcc4-be7f031486c4 timed out due to inactivity
18:02:45.536 [typedb-service::0] ERROR com.vaticle.typedb.core.server.TypeDBService -- [RPL01] Replica Error: The replica is not the primary replica.
com.vaticle.typedb.cloud.common.exception.TypeDBCloudException: [RPL01] Replica Error: The replica is not the primary replica.
18:05:48.471 [typedb-service::0] ERROR com.vaticle.typedb.core.server.TypeDBService -- [RPL01] Replica Error: The replica is not the primary replica.
com.vaticle.typedb.cloud.common.exception.TypeDBCloudException: [RPL01] Replica Error: The replica is not the primary replica.
18:27:14.712 [TypeDBServer::shutdown::0] INFO com.vaticle.typedb.core.server.TypeDBServer -- 
18:27:14.713 [TypeDBServer::shutdown::0] INFO com.vaticle.typedb.core.server.TypeDBServer -- 
18:27:14.713 [TypeDBServer::shutdown::0] INFO com.vaticle.typedb.core.server.TypeDBServer -- Shutting down TypeDB Cloud server...
18:27:14.713 [TypeDBServer::shutdown::0] INFO com.vaticle.typedb.core.server.TypeDBServer -- Shutting down TypeDB Cloud server...
18:27:14.728 [TypeDBServer::shutdown::0] INFO com.vaticle.typedb.core.server.TypeDBServer -- 
18:27:14.728 [TypeDBServer::shutdown::0] INFO com.vaticle.typedb.core.server.TypeDBServer -- Shutting down TypeDB Cloud server...
18:27:16.099 [h5r3a0-2.dep.cloud.typedb.com:1729::typedb-cloud-actor::0] INFO com.vaticle.typedb.cloud.replication.raft.role.Candidate -- [replica: _system] - Haven't received heartbeat from leader

farost commented 1 month ago

Analysing the logs and adding context based on my memory:

We accidentally ran a long query and waited for it to finish before pressing X in Studio. After about a minute of running the query, the button is pressed, and Studio logs it:
```
2024-05-13 17:40:07,923 [Thread-646] [ERROR] c.v.t.s.s.connection.SessionState - [CNX07] TypeDB Connection: Session was closed on TypeDB Server.
```
I don't remember if we did anything in Studio after it, but I'm pretty sure that we didn't leave any of queries running for more than 1.5 minutes. But looking at the Server logs:
```
17:44:07.883 [typedb-scheduled::0] ERROR com.vaticle.typedb.core.server.TransactionService -- [SRV29] Invalid Server Operation: Transaction exceeded maximum configured duration of '300' seconds.
17:44:37.882 [typedb-scheduled::0] WARN com.vaticle.typedb.core.server.SessionService -- Session with ID d09a5def-9b00-47a1-bcc4-be7f031486c4 timed out due to inactivity
```
We see that the server canceled a transaction (and the running query) after 300 seconds, so it was still running! And then it closed a session after 30 more seconds, and it could be a sign that we didn't do anything in Studio while discussing things. So it looks like the studio cancelation did not work, but the transaction timeout did work.

Then, after about 20-30 minutes, we ran one long query, canceled it, waited for about 5-10 minutes, ran another long query, canceled it, and ran a potentially quick one right after it. And the Studio froze right when we clicked the run button to send the fixed quick query.

I've already confirmed that this behavior (Studio freeze) happens when the server is too busy with the previous queries whose number is equal to the number of CPUs and we send another query (or MAX_THREADS available for the app, presumably a half of the CPUs number as I need to run 10 queries on my 10 CPU machine to have the next query being the deadliest one).

So it means that the server had 2 long queries running in the background, overtaken all the available threads for query execution, and it stopped responding to additional queries even if we sent cancel requests for the previous ones (or at least we thought we did as we don't see it in the Studio logs).

These logs:

18:02:45.536 [typedb-service::0] ERROR com.vaticle.typedb.core.server.TypeDBService -- [RPL01] Replica Error: The replica is not the primary replica.
com.vaticle.typedb.cloud.common.exception.TypeDBCloudException: [RPL01] Replica Error: The replica is not the primary replica.
18:05:48.471 [typedb-service::0] ERROR com.vaticle.typedb.core.server.TypeDBService -- [RPL01] Replica Error: The replica is not the primary replica.
com.vaticle.typedb.cloud.common.exception.TypeDBCloudException: [RPL01] Replica Error: The replica is not the primary replica.

can signal that we sent 2 long queries at this time, but it's hard to confirm without additional Studio logs. However, the server stopped responding to connection attempts from any clients. Its non-leader replica peers still thought that the raft leader was alive, so it wasn't completely dead, however with a 170.3% CPU usage from java (two cores) and without responding to the connection commands (that were redirected to it from peers).

After waiting for more time, we decided to stop the servers.

No clear results for now, will try to reproduce it with different clients and debug the server further.

farost commented 1 month ago

Actually, a more complicated query can make the thread think so hard that it just ignores the transaction timeout...

krishnangovindraj commented 2 weeks ago

Initial findings:

The x button in studio does not close the transaction: A breakpoint in CoreTransaction::close is not triggered.
Studio has a default time-out of 60 minutes: This seems to be respected (Verified by setting a breakpoint where the timeout task is scheduled). This makes me suspect the 300s timeout in the log is from a different transaction[1]
The size of the thread-pool available to TypeDB service is limited to the number of cores on the machine (or maybe twice this number). This causes any further incoming requests to queue up: Running an n+1th query from studio won't hit a breakpoint in QueryService::execute.
The traversal engine (i.e., when inference is not required) respects the transaction close on timeout: The time spent in these threads according to VisualVM stops going up when the transaction is closed (Thread: typedb-service::<i> and typedb-async-1::<I>)
A query with reasoning enabled is not as well behaved - execution continues and the actor threads continue to be used. Since individual tasks are short lived, this does not freeze out other work completely, but can take up a lot of CPU bandwidth.

[1] Why does this timeout in the server instead of just waiting forever for a thread to be allocated (as in point 3)? The query being run from studio has a large number of answers. Since we batch transaction streaming requests, this will yield and free up a thread to service other requests, such as a transaction open from the TypeDB Diagnostics server. By the time it returns and the Diagnostics server can issue the next request, studio may well have asked for the next batch and will occupy that thread in the pool past the timeout of the Diagnostics server transaction.

vaticle / typedb