Open BugenZhao opened 3 months ago
The cause is that we override the system behavior of killing the process immediately on SIGINT
by handling them with tokio::signal::ctrl_c
in order to implement graceful shutdown, but not actually comprehensive or correct.
I'm wondering that if some sort of graceful shutdown is really necessary in our system. What about simply letting the components be killed by the system? As...
cc @zwang28 @yezizp2012
What about simply letting the components be killed by the system?
I think it's acceptable since this has always been done in the madsim recovery test.
What about simply letting the components be killed by the system?
I think it's acceptable since this has always been done in the madsim recovery test.
I guess madsim kill is still like a soft killing that triggers ctrl-c signal and runs destructors? @wangrunji0408 👀
What about simply letting the components be killed by the system?
I think it's acceptable since this has always been done in the madsim recovery test.
I guess madsim kill is still like a soft killing that triggers ctrl-c signal and runs destructors? @wangrunji0408 👀
IIUC, madsim only issue a soft killing to shutdown the cluster after all tests are completed. During recovery it's hard killing.
I think graceful shutdown still has some benefits. For example, a compute node can actively tell the meta node to remove it when being killed, which is more responsive compared to the current approach that meta node passively finds that based on the timeout of heartbeat messages. (manual operation is required now: https://docs.risingwave.com/docs/current/k8s-cluster-scaling/#scale-in)
However, what we're doing now for graceful shutdown seems not to help at all, that is, shutting down each manager/worker one by one without doing any extra clean-ups. This can be replaced by exiting the process directly without any loss but make the code cleaner.
Also link to https://github.com/risingwavelabs/risingwave/pull/7675 which added the tests for graceful shutdown under simulation. cc @wangrunji0408 Could you share some updates on this (if any)? For example, is it still used or enforced currently?
Only clean-up & refactor tasks remaining.
Two panics will occur when pressing Ctrl-C, resulting in an ungraceful exit process. We should consider improve this user experience.