Closed bleggett closed 1 year ago
accepting -1 sgtm.
On Mon, May 15, 2023 at 7:29 AM Ben Leggett @.***> wrote:
xref: istio/istio#44813 (comment) https://github.com/istio/istio/pull/44813#issuecomment-1547953730
Inside of K8S, ztunnel's TERMINATION_GRACE_PERIOD setting overlaps with/conflicts with the K8S terminationGracePeriodSeconds property.
- ztunnel's TERMINATION_GRACE_PERIOD configures how long ztunnel blocks after getting a SIGTERM to finish draining its connections, before giving up and exiting.
- K8S's terminationGracePeriodSeconds is a standard K8S pod prop that configures how long K8S waits after sending a SIGTERM to ztunnel before it gives up waiting and sends e.g. a SIGKILL.
Problem:
- Objections were raised to auto-setting ztunnel's TERMINATION_GRACE_PERIOD to a value smaller than the K8S-configured terminationGracePeriodSeconds in our Helm chart.
- Objections were raised to exposing both properties individually in our Helm chart
- If the K8S setting exists, the ztunnel setting must be smaller than the K8S setting to avoid unexpected behavior.
- Simply removing TERMINATION_GRACE_PERIOD as a config option entirely from ztunnel works in a K8S context, but creates problems when running ztunnel outside of K8S (VMs, etc), as it creates the requirement for an external reaper/watchdog.
Suggestion:
- If we really don't want to autoset the ztunnel property based on the K8S property value in our helm chart, and we can't remove one of the options, then suggest we overload TERMINATION_GRACE_PERIOD to accept e.g. -1, which tells ztunnel to block indefinitely draining connections, until/unless an external watchdog (K8S) terminates it.
— Reply to this email directly, view it on GitHub https://github.com/istio/ztunnel/issues/519, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEYGXLHPJHX6MALAKKG7LDXGI4VPANCNFSM6AAAAAAYCJDDWE . You are receiving this because you are subscribed to this thread.Message ID: @.***>
I don't really love the idea of having ztunnel block indefinitely. Ideally ztunnel would always FIN & RST all it's connections. I think the drop semantics of Rust should handle this if we're bringing down ztunnel in any sort of semi-graceful way. If k8s (or another external watchdog) just ends the process I think the likelihood of having connections which are no longer valid but were never closed is going to go up. If we'd prefer ztunnel manages to terminate itself vs being ended by an external watcher then having some TERMINATION_GRACE_PERIOD
defined which is strictly less than the watchdog's grace period is something we should want.
I don't really love the idea of having ztunnel block indefinitely.
~block indefinitely~ block until it receives a stronger TERM signal, or it successfully drains all its connections.
If k8s (or another external watchdog) just ends the process I think the likelihood of having connections which are no longer valid but were never closed is going to go up.
We can't control what the process scheduler does, but almost all of them support "ask, wait, then ask less nicely". ztunnel has no real control over how long that wait is, regardless. Could be zero seconds, could be a week. Current connection termination behavior is we do what we can before our scheduler reaps us.
I'm not sure I'm understanding this then. If you have -1
would ztunnel just close connections at it's own maximum possible aggression and block only until the connections are terminate or it get's a stronger signal? I mean this as opposed to something like allowing existing connections to naturally run their course and be closed by the clients until it get's a stronger signal.
If you have
-1
would ztunnel just close connections at it's own maximum possible aggression and block only until the connections are terminate or it get's a stronger signal?
Yes, that's the proposal.
But when it gets SIGKILL it cannot gracefully cleanup, its force closed, right? I think that is Ian's concern
On Mon, May 15, 2023 at 9:21 AM Ben Leggett @.***> wrote:
If you have -1 would ztunnel just close connections at it's own maximum possible aggression and block only until the connections are terminate or it get's a stronger signal?
Yes, that's the proposal.
— Reply to this email directly, view it on GitHub https://github.com/istio/ztunnel/issues/519#issuecomment-1548168079, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEYGXPGW66XGUOLIBZSP3TXGJJX5ANCNFSM6AAAAAAYCJDDWE . You are receiving this because you commented.Message ID: @.***>
But when it gets SIGKILL it cannot gracefully cleanup, its force closed, right? I think that is Ian's concern
Yep - if it hasn't finished by the time SIGKILL (or whatever) comes around, it will be forcibly terminated.
That's a problem with or without TERMINATION_GRACE_PERIOD tho, we have no actual control over when whatever process scheduler running ztunnel decides to give up waiting for ztunnel to exit cleanly.
TERMINATION_GRACE_PERIOD
is nothing but a way to handle this without a process scheduler (K8S, systemd/init/whatever) present. If one is present, it's a moot setting.
yeah but if ztunnel decides to exit it can shutdown the connections vs the process just exiting? TBH I don't know for sure what differences this ends up being in practice
On Mon, May 15, 2023 at 9:25 AM Ben Leggett @.***> wrote:
But when it gets SIGKILL it cannot gracefully cleanup, its force closed, right? I think that is Ian's concern
Yep - if it hasn't finished by the time SIGKILL (or whatever) comes around, it will be forcibly terminated.
That's a problem with or without TERMINATION_GRACE_PERIOD tho, we have no actual control over when whatever process scheduler running ztunnel decides to give up waiting for ztunnel to exit cleanly.
— Reply to this email directly, view it on GitHub https://github.com/istio/ztunnel/issues/519#issuecomment-1548173816, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEYGXOYXD64JDF44F2BYWDXGJKINANCNFSM6AAAAAAYCJDDWE . You are receiving this because you commented.Message ID: @.***>
yeah but if ztunnel decides to exit it can shutdown the connections vs the process just exiting? TBH I don't know for sure what differences this ends up being in practice
Will ztunnel ever decide to self-exit, or will it respond to TERM
signals? I think only the latter.
If only the latter, then however much time the process scheduler gives it to exit after sending it the first TERM signal is how much time it has to close connections.
-1 means we lose (2) and go straight to (3)?
On Mon, May 15, 2023 at 9:30 AM Ben Leggett @.***> wrote:
yeah but if ztunnel decides to exit it can shutdown the connections vs the process just exiting? TBH I don't know for sure what differences this ends up being in practice
Will ztunnel ever decide to self-exit, or will it respond to TERM signals? I think only the latter.
If only the latter, then however much time the process scheduler gives it to exit after sending it the first TERM signal is how much time it has to close connections.
— Reply to this email directly, view it on GitHub https://github.com/istio/ztunnel/issues/519#issuecomment-1548180554, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEYGXI3ARRRICH57BUMT73XGJK3DANCNFSM6AAAAAAYCJDDWE . You are receiving this because you commented.Message ID: @.***>
- SIGTERM: start draining
- past TERMINATION_GRACE_PERIOD: ztunnel decides to exit
- SIGKILL: process killed abruptly.
-1 means we lose (2) and go straight to (3)?
-1 means
I don't think you can catch sigkill though. you just die
On Mon, May 15, 2023 at 9:41 AM Ben Leggett @.***> wrote:
- SIGTERM: start draining
- past TERMINATION_GRACE_PERIOD: ztunnel decides to exit
- SIGKILL: process killed abruptly.
-1 means we lose (2) and go straight to (3)?
-1 means
- SIGTERM: start draining
- Exit when fully drained, or get SIGKILL, whichever comes first.
— Reply to this email directly, view it on GitHub https://github.com/istio/ztunnel/issues/519#issuecomment-1548195626, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEYGXJ2I26VLEZJE55FZ2DXGJMEFANCNFSM6AAAAAAYCJDDWE . You are receiving this because you commented.Message ID: @.***>
- SIGTERM: start draining
- past TERMINATION_GRACE_PERIOD: ztunnel decides to exit
- SIGKILL: process killed abruptly.
This is what I thought the "ideal" shutdown should look like. If we have TERMINATION_GRACE_PERIOD
set to -1 should we essentially skip over the "start draining" phase and move directly into the we're exiting now so here are at least some RST?
- SIGTERM: start draining
- past TERMINATION_GRACE_PERIOD: ztunnel decides to exit
- SIGKILL: process killed abruptly.
This is what I thought the "ideal" shutdown should look like. If we have
TERMINATION_GRACE_PERIOD
set to -1 should we essentially skip over the "start draining" phase and move directly into the we're exiting now so here are at least some RST?Ideally ztunnel would always FIN & RST all it's connections.
We can't FIN if ztunnel has no idea how long the process scheduler will wait before SIGKILL, correct.
We have to start RST on SIGTERM. Or, we go back to the initial proposal I made and set TERMINATION_GRACE_PERIOD
to a strictly smaller value than the process scheduler's grace period, giving us a reasonable guarantee that we won't get a SIGKILL until our internal best-effort duration expires.
My main overarching concern is that we're moving from a paradigm of sidecars where essentially a connection to your local istio proxy cannot fail and your client is free to be as poorly behaved as it wants. That's not 100% accurate but it's trying to approach that limit so to speak. With ztunnel we are introducing a client case of "my local istio proxy actually went away and I need to do something about it" so IMO ztunnel should be as well behaved as we can possibly make it. Give things a fighting chance.
I know from personal experience there are network libs that are happy to ignore FIN and RST from a server and expect to be in full control of connection lifecycle at all times. Not much we can do about that really. I guess use a sidecar. For the rest though not ghosting a connection will probably be the better option even if connections are ending more abruptly than we would ideally hope for.
We can't RST on SIGTERM or it breaks the TERMINATION_GRACE_PERIOD. The goal of TERMINATION_GRACE_PERIOD is not to terminate all connections, it's to let them drain naturally. if we terminate them it impacts the user. Ideally user doesn't have too long lived connections, and naturally we get to zero connections.
in practice, not all connections will meet this criteria, so we have to cut it off at some point. But not immediately..
Or, we go back to the initial proposal I made and set TERMINATION_GRACE_PERIOD to a strictly smaller value than the process scheduler's grace period, giving us a guarantee that we won't get a SIGTERM until we best-effort FIN everything.
SIGTERM is always the first thing thathappens
On Mon, May 15, 2023 at 9:51 AM Ben Leggett @.***> wrote:
- SIGTERM: start draining
- past TERMINATION_GRACE_PERIOD: ztunnel decides to exit
- SIGKILL: process killed abruptly.
This is what I thought the "ideal" shutdown should look like. If we have TERMINATION_GRACE_PERIOD set to -1 should we essentially skip over the "start draining" phase and move directly into the we're exiting now so here are at least some RST?
Ideally ztunnel would always FIN & RST all it's connections.
We can't FIN if ztunnel has no idea how long the process scheduler will wait before SIGKILL, correct.
We have to start RST on SIGTERM. Or, we go back to the initial proposal I made and set TERMINATION_GRACE_PERIOD to a strictly smaller value than the process scheduler's grace period, giving us a guarantee that we won't get a SIGTERM until we best-effort FIN everything.
— Reply to this email directly, view it on GitHub https://github.com/istio/ztunnel/issues/519#issuecomment-1548208535, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEYGXLKP5XJNOPWGR6UOTTXGJNKRANCNFSM6AAAAAAYCJDDWE . You are receiving this because you commented.Message ID: @.***>
in practice, not all connections will meet this criteria, so we have to cut it off at some point. But not immediately..
ztunnel proper has absolutely no control over the cutoff, unless ztunnel's cutoff is explicitly configured to be LESS-THAN the platform/process scheduler cutoff out of band. That's the problem. TERMINATION_GRACE_PERIOD
is a suggestion no one else is paying attention to.
SIGTERM is always the first thing thathappens
Yep, meant SIGKILL, fixed.
We can't FIN if ztunnel has no idea how long the process scheduler will wait before SIGKILL, correct.
Yeah. I think that's the most correct option. If we don't have any known shutdown time window just start with RST. It's not perfect but would be sort of a worst case fall back mode.
The recommended setting would be to configure this ztunnel value to something strictly less than the k8s grace period which would then allow for a much nicer behavior.
I think the correct fix is to use TERMINATION_GRACE_PERIOD as the primary mechanism, and just make the pod setting +1min or something, just as a backstop incase ztunnel is broken and doesn't exit when it should. I agree that makes the helm discussion back in picture, but @ilrudie 's point that we would not gracefully shutdown with SIGKILL is more problamtic IMO.
I think the correct fix is to use TERMINATION_GRACE_PERIOD as the primary mechanism, and just make the pod setting +1min or something, just as a backstop incase ztunnel is broken and doesn't exit when it should. I agree that makes the helm discussion back in picture, but @ilrudie 's point that we would not gracefully shutdown with SIGKILL is more problamtic IMO.
Agreed.
If we want to actually do best-effort FIN, we need $ztunnelGrace and $platformGrace to be directly related/linked in a strictly LESS-THAN relationship, which is what https://github.com/istio/istio/pull/44813 does.
If we don't want them to be directly related/linked, then we can't do best-effort FIN and have to settle for best-effort RST.
We either need to document that, or put it in Helm (or both, probably)
@costinm thoughts/suggestions on above?
Yet another NIH way to stop processes ?
The question I usually have is: what is the common/standard/prior art ? And whatever use case you are trying to solve - did anyone else really had to solve it and what are the exact requirements ?
My understanding is that most proxies ( and servers of any kind ) are notified that they need to gracefully stop - via SIGTERM usually.
Normally the LB and CNI will direct clients to a different instance and drive the draining. For pod to pod - ztunnel should stop accepting new connections ( and the go-away is exactly for this purpose ) but let existing connection alone.
What would it do differently with the extra variable ? The existing connections are terminated by sigkill anyways.
If we find prior art - I think we should learn from it, but also decide if it is a priority and worth the complexity ( the env var is easy for us - but people operating ztunnel will need to understand it, we need to support it long term if we believe it's really important api, etc)
Envoy exposes drain time as a command line argument. Seems similar in concept to having the variable to define how long ztunnel should attempt to gently drain connections before becoming more aggressive.
Yet another NIH way to stop processes ?
No, just that ztunnel
, like all proxies, needs to be informed of what time window it has to gracefully FIN connections.
The time window ztunnel
has to gracefully FIN connections is the time window between SIGTERM and SIGKILL.
Once ztunnel
gets SIGKILL, it's out of time and cannot do anything in response to that but be killed immediately.
So, if the platform (k8s, systemd) has a grace period after SIGTERM before it sends SIGKILL, ztunnel
needs to know that it has LESS-THAN that amount of time to wait for clients to gracefully accept FIN-style connection closes, before forcefully sending RST connection closes.
As mentioned above we cannot simply ignore FIN and jump right to RST on SIGTERM, and even if we did we would potentially have the same problem where we have no idea how long the platform will give us to handle that.
On Tue, May 16, 2023 at 9:31 AM Ben Leggett @.***> wrote:
Yet another NIH way to stop processes ?
No, just that ztunnel, like all proxies, needs to be informed of what time window it has to gracefully FIN connections.
Ztunnel can't FIN the streams forwarded to the app - app can gracefully FIN.
Ztunnel can send GO-AWAY or stop accepting - and for that we have SIGTERM.
Envoy and ztunnel as a sidecar are a different story from ztunnel as per-node server.
The time window ztunnel has to gracefully FIN connections is the time window between SIGTERM and SIGKILL.
Once ztunnel gets SIGKILL, it's out of time and cannot do anything in response to that but be killed immediately.
So, if the platform (k8s, systemd) has a grace period after SIGTERM before it sends SIGKILL, ztunnel needs to know that it has LESS-THAN that amount of time to wait for clients to gracefully accept FIN-style connection closes, before forcefully sending RST connection closes.
I still don't understand what ztunnel would do after between the end of the drain duration set and SIGKILL. After SIGTERM - stop accept, send GO-AWAY. After DRAIN_DURATION - ???? SIGKILL automatically sends RST on all remaining connections.
Message ID: @.***>
The claim about was SIGKILL would not result in RSTs. If actually does, I agree. If it does not thenI think we need that?
On Tue, May 16, 2023 at 12:34 PM Costin Manolache @.***> wrote:
On Tue, May 16, 2023 at 9:31 AM Ben Leggett @.***> wrote:
Yet another NIH way to stop processes ?
No, just that ztunnel, like all proxies, needs to be informed of what time window it has to gracefully FIN connections.
Ztunnel can't FIN the streams forwarded to the app - app can gracefully FIN.
Ztunnel can send GO-AWAY or stop accepting - and for that we have SIGTERM.
Envoy and ztunnel as a sidecar are a different story from ztunnel as per-node server.
The time window ztunnel has to gracefully FIN connections is the time window between SIGTERM and SIGKILL.
Once ztunnel gets SIGKILL, it's out of time and cannot do anything in response to that but be killed immediately.
So, if the platform (k8s, systemd) has a grace period after SIGTERM before it sends SIGKILL, ztunnel needs to know that it has LESS-THAN that amount of time to wait for clients to gracefully accept FIN-style connection closes, before forcefully sending RST connection closes.
I still don't understand what ztunnel would do after between the end of the drain duration set and SIGKILL. After SIGTERM - stop accept, send GO-AWAY. After DRAIN_DURATION - ???? SIGKILL automatically sends RST on all remaining connections.
Message ID: @.***>
— Reply to this email directly, view it on GitHub https://github.com/istio/ztunnel/issues/519#issuecomment-1550248445, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEYGXNI6L7FGIONLUEWVXLXGPJDFANCNFSM6AAAAAAYCJDDWE . You are receiving this because you commented.Message ID: @.***>
I don't think ztunnel can react to a SIGKILL to send RST. Anything it want's to send must be sent prior in response to the SIGTERM.
re: John's point. If drops still occur in ztunnel and RST would be sent that's a different story.
Ztunnel can't FIN the streams forwarded to the app - app can gracefully FIN. Ztunnel can send GO-AWAY or stop accepting - and for that we have SIGTERM. Envoy and ztunnel as a sidecar are a different story from ztunnel as per-node server.
Do we need ztunnel to FIN outbound connections it's proxying to upstream servers? If not, and it can just forcibly close outbound connections with no grace period, then we don't need it - if it does, than we do.
ztunnel needs to send GOAWAY/RST to clients, and close the connection, and it can start doing this on SIGTERM, yes, I got that backwards.
Real basic unscientific local test. If I let this run its course I see the drop printed. If I kill -9
the pid it never drops.
use std::{thread, time};
struct Droppable {
name: &'static str,
}
impl Drop for Droppable {
fn drop(&mut self) {
println!("> Dropping {}", self.name);
}
}
fn main() {
let _a = Droppable { name: "testing" };
let delay = time::Duration::from_secs(30);
thread::sleep(delay);
}
Real basic unscientific local test. If I let this run its course I see the drop printed. If I
kill -9
the pid it never drops.use std::{thread, time}; struct Droppable { name: &'static str, } impl Drop for Droppable { fn drop(&mut self) { println!("> Dropping {}", self.name); } } fn main() { let _a = Droppable { name: "testing" }; let delay = time::Duration::from_secs(30); thread::sleep(delay); }
It doesn't matter tho - there's no implicit wait
in that scenario. When you get SIGTERM you start dropping connections as fast as possible. Whether they're all dropped by the time you get the followup SIGKILL is moot, you (ztunnel) have no control over that anyway and don't need a wait-window to deal with it.
The test was just whether the drop mechanism in Rust would somehow allow it to send a RST after getting SIGKILL which seems to be no.
We don't drop connections on SIGTERM, otherwise its not a graceful shutdown. if we did then its trivial. The SIGTERM signals "stop accepting new connections, and wait for the clients to politely close connections". There is no point closing the client connections early unless we are about to shutdown immediately after.
We don't drop connections on SIGTERM, otherwise its not a graceful shutdown. if we did then its trivial. The SIGTERM signals "stop accepting new connections, and wait for the clients to politely close connections". There is no point closing the client connections early unless we are about to shutdown immediately after.
That depends so much on specific client config, is it very useful as a ztunnel setting? We could just as easily stop redirecting new connections to ztunnel in e.g. CNI, in which case a refusal window in zt proper doesn't matter, drains of existing connections is all that's needed.
Even without that tho I'm not sure it's very useful.
If ztunnel's TERMINATION_GRACE_PERIOD
== 30 secs (and the platforms is, say, 1 sec greater, 31 sec)
Yes, long lived connections are broken. That is why we should avoid killing ztunnel as much as possible (for example, time it with a node upgrade), and use very long grace periods when we do have to (far longer than 30s - more like 30h)
In waypoint/sidecar this is better btw, since we can send GOAWAY and connection: close
. But cannot signal to apps for raw TCP
Do we care at all about ztunnel needing to gracefully (i.e. with a window) terminate its upstream connections?
That's the only scenario where I think TERMINATION_GRACE_PERIOD
is really needed as a ztunnel setting, to @costinm 's point
I'm not sure what gracefully terminating upstream proxied connections gets us unless we want to wait and kill the downstream connection until after we've gracefully terminated the corresponding upstream one.
I am 100% sure killing a process sends RST.
Upstream tcp connections can't be gracefully closed either - we don't control the ends, would be a bug if ztunnel sent a FIN when the local pod didn't initiate it and may have in flight data.
I do agree ztunnel stop should be very long - but that may be independent. You can cordon a node for 24 h - and if we implement the swapping, whatever controls the process can coordinate it independent of k8s process( which is not designed for node-wide critical services).
On Tue, May 16, 2023, 12:52 John Howard @.***> wrote:
The claim about was SIGKILL would not result in RSTs. If actually does, I agree. If it does not thenI think we need that?
On Tue, May 16, 2023 at 12:34 PM Costin Manolache @.***> wrote:
On Tue, May 16, 2023 at 9:31 AM Ben Leggett @.***> wrote:
Yet another NIH way to stop processes ?
No, just that ztunnel, like all proxies, needs to be informed of what time window it has to gracefully FIN connections.
Ztunnel can't FIN the streams forwarded to the app - app can gracefully FIN.
Ztunnel can send GO-AWAY or stop accepting - and for that we have SIGTERM.
Envoy and ztunnel as a sidecar are a different story from ztunnel as per-node server.
The time window ztunnel has to gracefully FIN connections is the time window between SIGTERM and SIGKILL.
Once ztunnel gets SIGKILL, it's out of time and cannot do anything in response to that but be killed immediately.
So, if the platform (k8s, systemd) has a grace period after SIGTERM before it sends SIGKILL, ztunnel needs to know that it has LESS-THAN that amount of time to wait for clients to gracefully accept FIN-style connection closes, before forcefully sending RST connection closes.
I still don't understand what ztunnel would do after between the end of the drain duration set and SIGKILL. After SIGTERM - stop accept, send GO-AWAY. After DRAIN_DURATION - ???? SIGKILL automatically sends RST on all remaining connections.
Message ID: @.***>
— Reply to this email directly, view it on GitHub https://github.com/istio/ztunnel/issues/519#issuecomment-1550248445, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAEYGXNI6L7FGIONLUEWVXLXGPJDFANCNFSM6AAAAAAYCJDDWE
. You are receiving this because you commented.Message ID: @.***>
— Reply to this email directly, view it on GitHub https://github.com/istio/ztunnel/issues/519#issuecomment-1550273231, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAUR2U7BMAPWJIPYRDHJPTXGPLIDANCNFSM6AAAAAAYCJDDWE . You are receiving this because you were mentioned.Message ID: @.***>
Sorry - kernel sends FIN actually when the process dies, RST in some cases.
On Tue, May 16, 2023, 14:21 Costin Manolache @.***> wrote:
I am 100% sure killing a process sends RST.
Upstream tcp connections can't be gracefully closed either - we don't control the ends, would be a bug if ztunnel sent a FIN when the local pod didn't initiate it and may have in flight data.
I do agree ztunnel stop should be very long - but that may be independent. You can cordon a node for 24 h - and if we implement the swapping, whatever controls the process can coordinate it independent of k8s process( which is not designed for node-wide critical services).
On Tue, May 16, 2023, 12:52 John Howard @.***> wrote:
The claim about was SIGKILL would not result in RSTs. If actually does, I agree. If it does not thenI think we need that?
On Tue, May 16, 2023 at 12:34 PM Costin Manolache @.***> wrote:
On Tue, May 16, 2023 at 9:31 AM Ben Leggett @.***> wrote:
Yet another NIH way to stop processes ?
No, just that ztunnel, like all proxies, needs to be informed of what time window it has to gracefully FIN connections.
Ztunnel can't FIN the streams forwarded to the app - app can gracefully FIN.
Ztunnel can send GO-AWAY or stop accepting - and for that we have SIGTERM.
Envoy and ztunnel as a sidecar are a different story from ztunnel as per-node server.
The time window ztunnel has to gracefully FIN connections is the time window between SIGTERM and SIGKILL.
Once ztunnel gets SIGKILL, it's out of time and cannot do anything in response to that but be killed immediately.
So, if the platform (k8s, systemd) has a grace period after SIGTERM before it sends SIGKILL, ztunnel needs to know that it has LESS-THAN that amount of time to wait for clients to gracefully accept FIN-style connection closes, before forcefully sending RST connection closes.
I still don't understand what ztunnel would do after between the end of the drain duration set and SIGKILL. After SIGTERM - stop accept, send GO-AWAY. After DRAIN_DURATION - ???? SIGKILL automatically sends RST on all remaining connections.
Message ID: @.***>
— Reply to this email directly, view it on GitHub https://github.com/istio/ztunnel/issues/519#issuecomment-1550248445, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAEYGXNI6L7FGIONLUEWVXLXGPJDFANCNFSM6AAAAAAYCJDDWE
. You are receiving this because you commented.Message ID: @.***>
— Reply to this email directly, view it on GitHub https://github.com/istio/ztunnel/issues/519#issuecomment-1550273231, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAUR2U7BMAPWJIPYRDHJPTXGPLIDANCNFSM6AAAAAAYCJDDWE . You are receiving this because you were mentioned.Message ID: @.***>
Sorry - kernel sends FIN actually when the process dies, RST in some cases.
FIN when client, RST when server, on socket events, regardless of whether the process is still technically there or not - ok, that makes sense.
Upstream tcp connections can't be gracefully closed either - we don't control the ends, would be a bug if ztunnel sent a FIN when the local pod didn't initiate it and may have in flight data.
Yep, I was thinking more we might need to signal the server in a timely fashion that the connection has ended.
But it sounds like a courtesy FIN will be sent either way, for that purpose.
Ok, then to summarize - I don't see a good argument for a discrete ztunnel config setting for termination grace.
How long it stays up is not up to ztunnel to manage, and whether new connections are redirected to it isn't up to ztunnel to manage either - in fact to extend @ilrudie's point above, ztunnel itself actively refusing connections still being routed to it would likely cause issues with clients not expecting that, versus just proactively redirecting new connections and externally starving ztunnel of new ones.
Termination is a very tricky subject. Sorry for the misinformation and confusion between RST and FIN, it's been ~10 years since I debugged FIN/RST in depth. There is much more - some kernels throttle and drop FIN/RST after a certain rate, TLS has it's own termination handshake - and usually applications making TCP connections don't have good ways to differentiate or detect if the connection is still alive. And a machine crash or network problems may also result in lost FIN/RST.
Even in normal operation (not shutdown) - I don't know if ztunnel can detect if either side sends a RST instead of FIN and propagate it.
But I think the main issue here is that while the upgrade for Ztunnel requires a (very) long drain, this env variable doesn't seem the right place. CNI + code that will drive the shift/upgrade - or whatever drives the node cordon process will have the power to coordinate this.
One thing to research is if when a node is upgraded, after cordon, the pods are terminated in any order and what is the timeout on the node shutdown, i.e. what happens if a pod takes 1 week to terminate ? If ztunnel is the last standing - we should be safe.
On Tue, May 16, 2023 at 5:24 PM Ben Leggett @.***> wrote:
Sorry - kernel sends FIN actually when the process dies, RST in some cases.
FIN when client, RST when server, on socket events, regardless of whether the process is still technically there or not - ok, that makes sense.
Upstream tcp connections can't be gracefully closed either - we don't control the ends, would be a bug if ztunnel sent a FIN when the local pod didn't initiate it and may have in flight data.
Yep, I was thinking more we might need to signal the server in a timely fashion that the connection has ended.
But it sounds like a courtesy FIN will be sent either way, for that purpose.
Ok, then to summarize - I don't see a good argument for a discrete ztunnel config setting for termination grace.
How long it stays up is not up to ztunnel to manage, and whether new connections are redirected to it isn't up to ztunnel to manage either - in fact to extend @ilrudie https://github.com/ilrudie's point above, ztunnel itself actively refusing connections still being routed to it would likely cause issues with clients not expecting that.
— Reply to this email directly, view it on GitHub https://github.com/istio/ztunnel/issues/519#issuecomment-1550518796, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAUR2SHVTFUVOIDEPOMZODXGQLDPANCNFSM6AAAAAAYCJDDWE . You are receiving this because you were mentioned.Message ID: @.***>
Agree with that, and unless there are objections I'll
TERMINATION_GRACE_PERIOD
from ztunnel - as pointed out, if such a grace period is required, it must be defined/coordinated outside of ztunnel.ztunnel itself actively refusing connections still being routed to it would likely cause issues with clients not expecting that, versus just proactively redirecting new connections and externally starving ztunnel of new ones.
Not sure I agree with this necessarily. Its better to reject a new connection then to accept a connection and drop it later.
However, we also cannot start rejecting immediately since there is some propogation delay before clients stop sending us new connections. Probably fine to just accept them, but I don't think its ideal
Not sure I agree with this necessarily. Its better to reject a new connection then to accept a connection and drop it later.
Agree, but I'm saying that ideally (and probably necessarily)cni
is in charge of whether a ztunnel pod sees a new connection to begin with, and can simply choose to not send any new connections to a (cordoned, or obsoleted, or whatever) ztunnel pod, starving it naturally. ztunnel doesn't need to reject new connections because it would never see them, once cni
decided to begin starving it.
tl;dr starve-and-terminate
versus reject-drain-and-terminate
- with the goal being that by the time you actually need to kill ztunnel under that model it should have a conncount of 0, or close to it.
Ultimately that's the model we actually have to get to, I think - to @costinm's point.
Maybe I just missed it but I don't think the proposal has been summarized in one post. As I understand it what's being put forth is something like:
I am not sure I agree. Seems like the first signal should be a SIGTERM, and SIGTERM does not mean closing connections.
John: there are 2 upgrade models mixed in this thread. For 'cordon' model, ztunnel continues accepting connections and working normally. The node will not schedule new pods and existing pods will be stopped. This is commonly used to upgrade the kernel or kubelet on nodes.
The second mode is a hot swap driven by CNI - new ztunnel on same node. I don't know how we want to do this - if it is a brand new pod, we also need to keep listening and process everything until the new pod is ready and propagated, followed by a goaway so new streams go to new pod. If it is using a new ztunnel binary in the same pod - without CNI intervention - the second ztunnel could listen on same port and old one stop accepting. I guess this is different enough to be treated as 3rd mode.
On Wed, May 17, 2023, 08:19 Ian Rudie @.***> wrote:
Maybe I just missed it but I don't think the proposal has been summarized in one post. As I understand it what's being put forth is something like:
Decision to terminate a ztunnel is made and externally implemented. Ztunnel is not aware of this and continues to operate normally
"Grace period" for connection draining occurs.
ztunnel receives SIGTERM. Upon handling the signal it immediately begins closing any existing connections (which would ideally be few/zero after the drain).
Send goaway. Maybe stop accepting new connections ( mode 3). But not close existing connections.
If the connections naturally close and no new ones get created - which would happen since xds advertises the new pod (in mode 2) or pods are terminated (mode 1) - if connection count is zero it may exit. Maybe... We need to put each mode in a doc and consider all cases...
- ztunnel has no more work to do and terminates gracefully or get's SIGKILLed.
That's the question - how to determine no more work is true. Mode 1 - no more pods on the node. Mode 2 - zero connections, but what if XDS update didn't propagate and some client keeps sending to old pod ? Can we just let it alive and not used until next upgrade ( blue/green).
- (if necessary) Kernel closes any remaining connections ztunnel wasn't able to close itself.
I don't think ztunnel should ever close streams. Idle h2 connections - yes, goaway - in mode 2 and 3 ( not 1)
When kubelet runs out of patience - it'll be killed by kernel and nothing we can do, close will be called.
— Reply to this email directly, view it on GitHub https://github.com/istio/ztunnel/issues/519#issuecomment-1551596362, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAUR2TGJRZGVSEYQQ4IQY3XGTUBNANCNFSM6AAAAAAYCJDDWE . You are receiving this because you were mentioned.Message ID: @.***>
Cordon still tries to terminate ztunnel, no? And thus gets a SIGTERM?
In all cases the signal to start "draining" is a SIGTERM in my mind?
Where "Drain" means "exit when we have no work left (zero connections)"
GOAWAY
I don't think this is useful since we are an L4 proxy. Closing the h2 connection vs GOAWAY is effectively the same -- the client ztunnel will close the app connection. We have no mechanism to signal to the application it should close the connection.
+1 for document the models more clearly so we can understand what exactly we're agreeing on.
I like @howardjohn's way to begin with SIGTERM to the ztunnel I think. At least in a k8s context it seems like a good signal to be able to handle as gracefully as possible. It wouldn't preclude some other way of initiating a drain event necessarily but would likely provide a better outcome in the event that a cluster administrator didn't follow istio-specific drain steps.
In all cases the signal to start "draining" is a SIGTERM in my mind?
That's fair if we want that model - even in that case, we still don't need TERMINATION_GRACE_PERIOD
, so for now it makes sense to proceed with removing that if you're good with it.
kubectl cordon <node>
prevent scheduling of most new pods
kubectl drain <node>
evict everything that's not managed by a daemonset, while respecting PDBs
kubectl drain --ignore-daemonsets <node>
evict everything including pods managed by a daemonset
Drain also cordons the node.
xref: https://github.com/istio/istio/pull/44813#issuecomment-1547953730
Inside of K8S, ztunnel's TERMINATION_GRACE_PERIOD setting overlaps with/conflicts with the K8S
terminationGracePeriodSeconds
property.TERMINATION_GRACE_PERIOD
configures how long ztunnel blocks after getting a SIGTERM to finish draining its connections, before giving up and exiting.terminationGracePeriodSeconds
is a standard K8S pod prop that configures how long K8S waits after sending a SIGTERM to ztunnel before it gives up waiting and sends e.g. a SIGKILL.Problem:
TERMINATION_GRACE_PERIOD
to be a value smaller than the K8S-configuredterminationGracePeriodSeconds
property in our Helm chart.TERMINATION_GRACE_PERIOD
as a config option entirely from ztunnel works in a K8S context, but creates problems when running ztunnel outside of K8S (VMs, etc) - without an internal grace period setting, ztunnel needs an external reaper/watchdog to handle that.Suggestion:
TERMINATION_GRACE_PERIOD
to accept e.g.-1
, which tells ztunnel to block indefinitely draining connections, until/unless an external watchdog (K8S, systemd, whatever) terminates it.