Open farost opened 1 month ago
Analysing the logs and adding context based on my memory:
X
in Studio.
After about a minute of running the query, the button is pressed, and Studio logs it:
2024-05-13 17:40:07,923 [Thread-646] [ERROR] c.v.t.s.s.connection.SessionState - [CNX07] TypeDB Connection: Session was closed on TypeDB Server.
I don't remember if we did anything in Studio after it, but I'm pretty sure that we didn't leave any of queries running for more than 1.5 minutes. But looking at the Server logs:
17:44:07.883 [typedb-scheduled::0] ERROR com.vaticle.typedb.core.server.TransactionService -- [SRV29] Invalid Server Operation: Transaction exceeded maximum configured duration of '300' seconds.
17:44:37.882 [typedb-scheduled::0] WARN com.vaticle.typedb.core.server.SessionService -- Session with ID d09a5def-9b00-47a1-bcc4-be7f031486c4 timed out due to inactivity
We see that the server canceled a transaction (and the running query) after 300 seconds, so it was still running! And then it closed a session after 30 more seconds, and it could be a sign that we didn't do anything in Studio while discussing things. So it looks like the studio cancelation did not work, but the transaction timeout did work.
Then, after about 20-30 minutes, we ran one long query, canceled it, waited for about 5-10 minutes, ran another long query, canceled it, and ran a potentially quick one right after it. And the Studio froze right when we clicked the run
button to send the fixed quick query.
I've already confirmed that this behavior (Studio freeze) happens when the server is too busy with the previous queries whose number is equal to the number of CPUs and we send another query (or MAX_THREADS
available for the app, presumably a half of the CPUs number as I need to run 10 queries on my 10 CPU machine to have the next query being the deadliest one).
So it means that the server had 2 long queries running in the background, overtaken all the available threads for query execution, and it stopped responding to additional queries even if we sent cancel
requests for the previous ones (or at least we thought we did as we don't see it in the Studio logs).
These logs:
18:02:45.536 [typedb-service::0] ERROR com.vaticle.typedb.core.server.TypeDBService -- [RPL01] Replica Error: The replica is not the primary replica.
com.vaticle.typedb.cloud.common.exception.TypeDBCloudException: [RPL01] Replica Error: The replica is not the primary replica.
18:05:48.471 [typedb-service::0] ERROR com.vaticle.typedb.core.server.TypeDBService -- [RPL01] Replica Error: The replica is not the primary replica.
com.vaticle.typedb.cloud.common.exception.TypeDBCloudException: [RPL01] Replica Error: The replica is not the primary replica.
can signal that we sent 2 long queries at this time, but it's hard to confirm without additional Studio logs. However, the server stopped responding to connection attempts from any clients. Its non-leader replica peers still thought that the raft leader was alive, so it wasn't completely dead, however with a 170.3%
CPU usage from java
(two cores) and without responding to the connection commands (that were redirected to it from peers).
After waiting for more time, we decided to stop the servers.
No clear results for now, will try to reproduce it with different clients and debug the server further.
Actually, a more complicated query can make the thread think so hard that it just ignores the transaction timeout...
Initial findings:
x
button in studio does not close the transaction: A breakpoint in CoreTransaction::close
is not triggered.n+1
th query from studio won't hit a breakpoint in QueryService::execute
.typedb-service::<i>
and typedb-async-1::<I>
)[1] Why does this timeout in the server instead of just waiting forever for a thread to be allocated (as in point 3)?
The query being run from studio has a large number of answers. Since we batch transaction streaming requests, this will yield and free up a thread to service other requests, such as a transaction open
from the TypeDB Diagnostics server. By the time it returns and the Diagnostics server can issue the next request, studio may well have asked for the next batch and will occupy that thread in the pool past the timeout of the Diagnostics server transaction.
Description
While executing long queries (sometimes unexpectedly long ones) from different clients (e.g. Studio or Console), a canceled query (
X
/lightning bolt
in Studio orCtrl + C
in Console) is expected to be canceled from the user's point of view. However, if the query takes much time to execute, it seems not to register thecancel
command and continue using the system resources.We initially faced this issue while sending several long
data read
queries to a server with 2 CPUs through Studio (pressing theX
button for each query and thinking that these queries would be canceled), resulting in the server's unavailability due to all its threads being busy with queries and ignorant of the connect attempts. After waiting for 20 minutes, we needed to restart the server.The same can be reproduced in Console if we run multiple queries in parallel transactions or even
Ctrl + C
them without closing the transactions.Interesting points of the investigation for now:
schema read
in Studio. The resources are immediately freed.transaction timeout
seems to help in these situations after some tests, but there are still suspicions that it sometimes may not.MAX_THREAD
number in the code, so it's a CPU-bound issue.Right now it looks like a combination of multiple issues from the clients and the server, but the server should be more reliable in this question for sure, so it will be the main issue of the investigation for now.
I'm going to update this issue while investigating the initial case and trying to reproduce it with different setups and clients.
Environment
Reproducible Steps
Set up Create a database and fill it with a sufficient amount of data. Write a
data read
query that can run for over 10 minutes. To reproduce it more easily, you can build the server locally and limit theMAX_THREADS
number in the code to emulate a machine with a lower number of CPUs.Execute Run queries multiple times (try to exceed the number of CPUs you have on the server) and try to cancel them: a) Studio: run ->
X
-> run ->X
-> ... b) Console: open multiple tabs, create multiple transactions, run queries in parallel, and pressCtrl + C
in each tab.Unexpected result Even if the client (Studio or Console) say that the query is canceled, it is not. The resources are still taken and the query is being executed (there is still
RocksDB
work if we enter the debug mode).Expected result
All the queries are canceled, and the work is stopped.
Additional logs
Here are logs from Studio and from Server when we initially hung up the cloud server by running multiple long queries and needed to restart it (the detailed analysis is in the comments).
Studio:
Server (all three nodes combined):