Chaos Mesh compute-meta-network-partition batch query fails occasionally

Describe the bug

https://buildkite.com/risingwave-test/longevity-chaos-mesh/builds/376#018ca450-a95e-4c3b-953d-68d154850aef

This experiment is implemented by @xuefengze.

SCR-20231226-oj6 In this experiment, we applied network partition from compute node to meta node. (Didn't say between compute node and meta node because we can specify direction, although it seems to have little thing to do with the following problem).

The duration is 10min.

SCR-20231226-qg9

We triggered the fault around 12:17:02. The partition will exist until 12:27:02.

SCR-20231226-qo8 We can see that while the partition exists, we can execute the select query without a problem.

But when we try to create table t1. The query was stuck for a long time and returned the error message at 12:27:03. This is exactly when the partition experiment finishes.

Although it is expected that the SQL fails, is it reasonable to let the create table t1 stuck for such a long time? For such a query, I think that we always expect it to finish with single-digit latency. Shall we add a timeout here?

SCR-20231226-qw2

After the partition experiment finished, we retried to create table t1 and it succeeded.

However, after that, when the network works as normal, executing a select query will fail.

Is it due to a similar issue as found in #14030?

Error message/log

No response

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

No response

Additional context

No response

risingwavelabs / risingwave