istio / ztunnel

The `ztunnel` component of ambient mesh
Apache License 2.0
287 stars 96 forks source link

`TERMINATION_GRACE_PERIOD` ztunnel setting #519

Closed bleggett closed 1 year ago

bleggett commented 1 year ago

xref: https://github.com/istio/istio/pull/44813#issuecomment-1547953730

Inside of K8S, ztunnel's TERMINATION_GRACE_PERIOD setting overlaps with/conflicts with the K8S terminationGracePeriodSeconds property.

Problem:

Suggestion:

howardjohn commented 1 year ago

accepting -1 sgtm.

On Mon, May 15, 2023 at 7:29 AM Ben Leggett @.***> wrote:

xref: istio/istio#44813 (comment) https://github.com/istio/istio/pull/44813#issuecomment-1547953730

Inside of K8S, ztunnel's TERMINATION_GRACE_PERIOD setting overlaps with/conflicts with the K8S terminationGracePeriodSeconds property.

  • ztunnel's TERMINATION_GRACE_PERIOD configures how long ztunnel blocks after getting a SIGTERM to finish draining its connections, before giving up and exiting.
  • K8S's terminationGracePeriodSeconds is a standard K8S pod prop that configures how long K8S waits after sending a SIGTERM to ztunnel before it gives up waiting and sends e.g. a SIGKILL.

Problem:

  • Objections were raised to auto-setting ztunnel's TERMINATION_GRACE_PERIOD to a value smaller than the K8S-configured terminationGracePeriodSeconds in our Helm chart.
  • Objections were raised to exposing both properties individually in our Helm chart
  • If the K8S setting exists, the ztunnel setting must be smaller than the K8S setting to avoid unexpected behavior.
  • Simply removing TERMINATION_GRACE_PERIOD as a config option entirely from ztunnel works in a K8S context, but creates problems when running ztunnel outside of K8S (VMs, etc), as it creates the requirement for an external reaper/watchdog.

Suggestion:

  • If we really don't want to autoset the ztunnel property based on the K8S property value in our helm chart, and we can't remove one of the options, then suggest we overload TERMINATION_GRACE_PERIOD to accept e.g. -1, which tells ztunnel to block indefinitely draining connections, until/unless an external watchdog (K8S) terminates it.

— Reply to this email directly, view it on GitHub https://github.com/istio/ztunnel/issues/519, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEYGXLHPJHX6MALAKKG7LDXGI4VPANCNFSM6AAAAAAYCJDDWE . You are receiving this because you are subscribed to this thread.Message ID: @.***>

ilrudie commented 1 year ago

I don't really love the idea of having ztunnel block indefinitely. Ideally ztunnel would always FIN & RST all it's connections. I think the drop semantics of Rust should handle this if we're bringing down ztunnel in any sort of semi-graceful way. If k8s (or another external watchdog) just ends the process I think the likelihood of having connections which are no longer valid but were never closed is going to go up. If we'd prefer ztunnel manages to terminate itself vs being ended by an external watcher then having some TERMINATION_GRACE_PERIOD defined which is strictly less than the watchdog's grace period is something we should want.

bleggett commented 1 year ago

I don't really love the idea of having ztunnel block indefinitely.

~block indefinitely~ block until it receives a stronger TERM signal, or it successfully drains all its connections.

If k8s (or another external watchdog) just ends the process I think the likelihood of having connections which are no longer valid but were never closed is going to go up.

We can't control what the process scheduler does, but almost all of them support "ask, wait, then ask less nicely". ztunnel has no real control over how long that wait is, regardless. Could be zero seconds, could be a week. Current connection termination behavior is we do what we can before our scheduler reaps us.

ilrudie commented 1 year ago

I'm not sure I'm understanding this then. If you have -1 would ztunnel just close connections at it's own maximum possible aggression and block only until the connections are terminate or it get's a stronger signal? I mean this as opposed to something like allowing existing connections to naturally run their course and be closed by the clients until it get's a stronger signal.

bleggett commented 1 year ago

If you have -1 would ztunnel just close connections at it's own maximum possible aggression and block only until the connections are terminate or it get's a stronger signal?

Yes, that's the proposal.

howardjohn commented 1 year ago

But when it gets SIGKILL it cannot gracefully cleanup, its force closed, right? I think that is Ian's concern

On Mon, May 15, 2023 at 9:21 AM Ben Leggett @.***> wrote:

If you have -1 would ztunnel just close connections at it's own maximum possible aggression and block only until the connections are terminate or it get's a stronger signal?

Yes, that's the proposal.

— Reply to this email directly, view it on GitHub https://github.com/istio/ztunnel/issues/519#issuecomment-1548168079, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEYGXPGW66XGUOLIBZSP3TXGJJX5ANCNFSM6AAAAAAYCJDDWE . You are receiving this because you commented.Message ID: @.***>

bleggett commented 1 year ago

But when it gets SIGKILL it cannot gracefully cleanup, its force closed, right? I think that is Ian's concern

Yep - if it hasn't finished by the time SIGKILL (or whatever) comes around, it will be forcibly terminated.

That's a problem with or without TERMINATION_GRACE_PERIOD tho, we have no actual control over when whatever process scheduler running ztunnel decides to give up waiting for ztunnel to exit cleanly.

TERMINATION_GRACE_PERIOD is nothing but a way to handle this without a process scheduler (K8S, systemd/init/whatever) present. If one is present, it's a moot setting.

howardjohn commented 1 year ago

yeah but if ztunnel decides to exit it can shutdown the connections vs the process just exiting? TBH I don't know for sure what differences this ends up being in practice

On Mon, May 15, 2023 at 9:25 AM Ben Leggett @.***> wrote:

But when it gets SIGKILL it cannot gracefully cleanup, its force closed, right? I think that is Ian's concern

Yep - if it hasn't finished by the time SIGKILL (or whatever) comes around, it will be forcibly terminated.

That's a problem with or without TERMINATION_GRACE_PERIOD tho, we have no actual control over when whatever process scheduler running ztunnel decides to give up waiting for ztunnel to exit cleanly.

— Reply to this email directly, view it on GitHub https://github.com/istio/ztunnel/issues/519#issuecomment-1548173816, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEYGXOYXD64JDF44F2BYWDXGJKINANCNFSM6AAAAAAYCJDDWE . You are receiving this because you commented.Message ID: @.***>

bleggett commented 1 year ago

yeah but if ztunnel decides to exit it can shutdown the connections vs the process just exiting? TBH I don't know for sure what differences this ends up being in practice

Will ztunnel ever decide to self-exit, or will it respond to TERM signals? I think only the latter.

If only the latter, then however much time the process scheduler gives it to exit after sending it the first TERM signal is how much time it has to close connections.

howardjohn commented 1 year ago
  1. SIGTERM: start draining
  2. past TERMINATION_GRACE_PERIOD: ztunnel decides to exit
  3. SIGKILL: process killed abruptly.

-1 means we lose (2) and go straight to (3)?

On Mon, May 15, 2023 at 9:30 AM Ben Leggett @.***> wrote:

yeah but if ztunnel decides to exit it can shutdown the connections vs the process just exiting? TBH I don't know for sure what differences this ends up being in practice

Will ztunnel ever decide to self-exit, or will it respond to TERM signals? I think only the latter.

If only the latter, then however much time the process scheduler gives it to exit after sending it the first TERM signal is how much time it has to close connections.

— Reply to this email directly, view it on GitHub https://github.com/istio/ztunnel/issues/519#issuecomment-1548180554, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEYGXI3ARRRICH57BUMT73XGJK3DANCNFSM6AAAAAAYCJDDWE . You are receiving this because you commented.Message ID: @.***>

bleggett commented 1 year ago
  1. SIGTERM: start draining
  2. past TERMINATION_GRACE_PERIOD: ztunnel decides to exit
  3. SIGKILL: process killed abruptly.

-1 means we lose (2) and go straight to (3)?

-1 means

  1. SIGTERM: start draining
  2. Exit when fully drained, or get SIGKILL, whichever comes first. ztunnel doesn't control the grace period, if there is one.
howardjohn commented 1 year ago

I don't think you can catch sigkill though. you just die

On Mon, May 15, 2023 at 9:41 AM Ben Leggett @.***> wrote:

  1. SIGTERM: start draining
  2. past TERMINATION_GRACE_PERIOD: ztunnel decides to exit
  3. SIGKILL: process killed abruptly.

-1 means we lose (2) and go straight to (3)?

-1 means

  1. SIGTERM: start draining
  2. Exit when fully drained, or get SIGKILL, whichever comes first.

— Reply to this email directly, view it on GitHub https://github.com/istio/ztunnel/issues/519#issuecomment-1548195626, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEYGXJ2I26VLEZJE55FZ2DXGJMEFANCNFSM6AAAAAAYCJDDWE . You are receiving this because you commented.Message ID: @.***>

ilrudie commented 1 year ago
  1. SIGTERM: start draining
  2. past TERMINATION_GRACE_PERIOD: ztunnel decides to exit
  3. SIGKILL: process killed abruptly.

This is what I thought the "ideal" shutdown should look like. If we have TERMINATION_GRACE_PERIOD set to -1 should we essentially skip over the "start draining" phase and move directly into the we're exiting now so here are at least some RST?

bleggett commented 1 year ago
  1. SIGTERM: start draining
  2. past TERMINATION_GRACE_PERIOD: ztunnel decides to exit
  3. SIGKILL: process killed abruptly.

This is what I thought the "ideal" shutdown should look like. If we have TERMINATION_GRACE_PERIOD set to -1 should we essentially skip over the "start draining" phase and move directly into the we're exiting now so here are at least some RST?

Ideally ztunnel would always FIN & RST all it's connections.

We can't FIN if ztunnel has no idea how long the process scheduler will wait before SIGKILL, correct.

We have to start RST on SIGTERM. Or, we go back to the initial proposal I made and set TERMINATION_GRACE_PERIOD to a strictly smaller value than the process scheduler's grace period, giving us a reasonable guarantee that we won't get a SIGKILL until our internal best-effort duration expires.

ilrudie commented 1 year ago

My main overarching concern is that we're moving from a paradigm of sidecars where essentially a connection to your local istio proxy cannot fail and your client is free to be as poorly behaved as it wants. That's not 100% accurate but it's trying to approach that limit so to speak. With ztunnel we are introducing a client case of "my local istio proxy actually went away and I need to do something about it" so IMO ztunnel should be as well behaved as we can possibly make it. Give things a fighting chance.

I know from personal experience there are network libs that are happy to ignore FIN and RST from a server and expect to be in full control of connection lifecycle at all times. Not much we can do about that really. I guess use a sidecar. For the rest though not ghosting a connection will probably be the better option even if connections are ending more abruptly than we would ideally hope for.

howardjohn commented 1 year ago

We can't RST on SIGTERM or it breaks the TERMINATION_GRACE_PERIOD. The goal of TERMINATION_GRACE_PERIOD is not to terminate all connections, it's to let them drain naturally. if we terminate them it impacts the user. Ideally user doesn't have too long lived connections, and naturally we get to zero connections.

in practice, not all connections will meet this criteria, so we have to cut it off at some point. But not immediately..

Or, we go back to the initial proposal I made and set TERMINATION_GRACE_PERIOD to a strictly smaller value than the process scheduler's grace period, giving us a guarantee that we won't get a SIGTERM until we best-effort FIN everything.

SIGTERM is always the first thing thathappens

On Mon, May 15, 2023 at 9:51 AM Ben Leggett @.***> wrote:

  1. SIGTERM: start draining
  2. past TERMINATION_GRACE_PERIOD: ztunnel decides to exit
  3. SIGKILL: process killed abruptly.

This is what I thought the "ideal" shutdown should look like. If we have TERMINATION_GRACE_PERIOD set to -1 should we essentially skip over the "start draining" phase and move directly into the we're exiting now so here are at least some RST?

Ideally ztunnel would always FIN & RST all it's connections.

We can't FIN if ztunnel has no idea how long the process scheduler will wait before SIGKILL, correct.

We have to start RST on SIGTERM. Or, we go back to the initial proposal I made and set TERMINATION_GRACE_PERIOD to a strictly smaller value than the process scheduler's grace period, giving us a guarantee that we won't get a SIGTERM until we best-effort FIN everything.

— Reply to this email directly, view it on GitHub https://github.com/istio/ztunnel/issues/519#issuecomment-1548208535, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEYGXLKP5XJNOPWGR6UOTTXGJNKRANCNFSM6AAAAAAYCJDDWE . You are receiving this because you commented.Message ID: @.***>

bleggett commented 1 year ago

in practice, not all connections will meet this criteria, so we have to cut it off at some point. But not immediately..

ztunnel proper has absolutely no control over the cutoff, unless ztunnel's cutoff is explicitly configured to be LESS-THAN the platform/process scheduler cutoff out of band. That's the problem. TERMINATION_GRACE_PERIOD is a suggestion no one else is paying attention to.

SIGTERM is always the first thing thathappens

Yep, meant SIGKILL, fixed.

ilrudie commented 1 year ago

We can't FIN if ztunnel has no idea how long the process scheduler will wait before SIGKILL, correct.

Yeah. I think that's the most correct option. If we don't have any known shutdown time window just start with RST. It's not perfect but would be sort of a worst case fall back mode.

The recommended setting would be to configure this ztunnel value to something strictly less than the k8s grace period which would then allow for a much nicer behavior.

howardjohn commented 1 year ago

I think the correct fix is to use TERMINATION_GRACE_PERIOD as the primary mechanism, and just make the pod setting +1min or something, just as a backstop incase ztunnel is broken and doesn't exit when it should. I agree that makes the helm discussion back in picture, but @ilrudie 's point that we would not gracefully shutdown with SIGKILL is more problamtic IMO.

bleggett commented 1 year ago

I think the correct fix is to use TERMINATION_GRACE_PERIOD as the primary mechanism, and just make the pod setting +1min or something, just as a backstop incase ztunnel is broken and doesn't exit when it should. I agree that makes the helm discussion back in picture, but @ilrudie 's point that we would not gracefully shutdown with SIGKILL is more problamtic IMO.

Agreed.

We either need to document that, or put it in Helm (or both, probably)

bleggett commented 1 year ago

@costinm thoughts/suggestions on above?

costinm commented 1 year ago

Yet another NIH way to stop processes ?

The question I usually have is: what is the common/standard/prior art ? And whatever use case you are trying to solve - did anyone else really had to solve it and what are the exact requirements ?

My understanding is that most proxies ( and servers of any kind ) are notified that they need to gracefully stop - via SIGTERM usually.

Normally the LB and CNI will direct clients to a different instance and drive the draining. For pod to pod - ztunnel should stop accepting new connections ( and the go-away is exactly for this purpose ) but let existing connection alone.

What would it do differently with the extra variable ? The existing connections are terminated by sigkill anyways.

If we find prior art - I think we should learn from it, but also decide if it is a priority and worth the complexity ( the env var is easy for us - but people operating ztunnel will need to understand it, we need to support it long term if we believe it's really important api, etc)

ilrudie commented 1 year ago

Envoy exposes drain time as a command line argument. Seems similar in concept to having the variable to define how long ztunnel should attempt to gently drain connections before becoming more aggressive.

bleggett commented 1 year ago

Yet another NIH way to stop processes ?

No, just that ztunnel, like all proxies, needs to be informed of what time window it has to gracefully FIN connections.

The time window ztunnel has to gracefully FIN connections is the time window between SIGTERM and SIGKILL.

Once ztunnel gets SIGKILL, it's out of time and cannot do anything in response to that but be killed immediately.

So, if the platform (k8s, systemd) has a grace period after SIGTERM before it sends SIGKILL, ztunnel needs to know that it has LESS-THAN that amount of time to wait for clients to gracefully accept FIN-style connection closes, before forcefully sending RST connection closes.

As mentioned above we cannot simply ignore FIN and jump right to RST on SIGTERM, and even if we did we would potentially have the same problem where we have no idea how long the platform will give us to handle that.

costinm commented 1 year ago

On Tue, May 16, 2023 at 9:31 AM Ben Leggett @.***> wrote:

Yet another NIH way to stop processes ?

No, just that ztunnel, like all proxies, needs to be informed of what time window it has to gracefully FIN connections.

Ztunnel can't FIN the streams forwarded to the app - app can gracefully FIN.

Ztunnel can send GO-AWAY or stop accepting - and for that we have SIGTERM.

Envoy and ztunnel as a sidecar are a different story from ztunnel as per-node server.

The time window ztunnel has to gracefully FIN connections is the time window between SIGTERM and SIGKILL.

Once ztunnel gets SIGKILL, it's out of time and cannot do anything in response to that but be killed immediately.

So, if the platform (k8s, systemd) has a grace period after SIGTERM before it sends SIGKILL, ztunnel needs to know that it has LESS-THAN that amount of time to wait for clients to gracefully accept FIN-style connection closes, before forcefully sending RST connection closes.

I still don't understand what ztunnel would do after between the end of the drain duration set and SIGKILL. After SIGTERM - stop accept, send GO-AWAY. After DRAIN_DURATION - ???? SIGKILL automatically sends RST on all remaining connections.

Message ID: @.***>

howardjohn commented 1 year ago

The claim about was SIGKILL would not result in RSTs. If actually does, I agree. If it does not thenI think we need that?

On Tue, May 16, 2023 at 12:34 PM Costin Manolache @.***> wrote:

On Tue, May 16, 2023 at 9:31 AM Ben Leggett @.***> wrote:

Yet another NIH way to stop processes ?

No, just that ztunnel, like all proxies, needs to be informed of what time window it has to gracefully FIN connections.

Ztunnel can't FIN the streams forwarded to the app - app can gracefully FIN.

Ztunnel can send GO-AWAY or stop accepting - and for that we have SIGTERM.

Envoy and ztunnel as a sidecar are a different story from ztunnel as per-node server.

The time window ztunnel has to gracefully FIN connections is the time window between SIGTERM and SIGKILL.

Once ztunnel gets SIGKILL, it's out of time and cannot do anything in response to that but be killed immediately.

So, if the platform (k8s, systemd) has a grace period after SIGTERM before it sends SIGKILL, ztunnel needs to know that it has LESS-THAN that amount of time to wait for clients to gracefully accept FIN-style connection closes, before forcefully sending RST connection closes.

I still don't understand what ztunnel would do after between the end of the drain duration set and SIGKILL. After SIGTERM - stop accept, send GO-AWAY. After DRAIN_DURATION - ???? SIGKILL automatically sends RST on all remaining connections.

Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/istio/ztunnel/issues/519#issuecomment-1550248445, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEYGXNI6L7FGIONLUEWVXLXGPJDFANCNFSM6AAAAAAYCJDDWE . You are receiving this because you commented.Message ID: @.***>

ilrudie commented 1 year ago

I don't think ztunnel can react to a SIGKILL to send RST. Anything it want's to send must be sent prior in response to the SIGTERM.

re: John's point. If drops still occur in ztunnel and RST would be sent that's a different story.

bleggett commented 1 year ago

Ztunnel can't FIN the streams forwarded to the app - app can gracefully FIN. Ztunnel can send GO-AWAY or stop accepting - and for that we have SIGTERM. Envoy and ztunnel as a sidecar are a different story from ztunnel as per-node server.

Do we need ztunnel to FIN outbound connections it's proxying to upstream servers? If not, and it can just forcibly close outbound connections with no grace period, then we don't need it - if it does, than we do.

ztunnel needs to send GOAWAY/RST to clients, and close the connection, and it can start doing this on SIGTERM, yes, I got that backwards.

ilrudie commented 1 year ago

Real basic unscientific local test. If I let this run its course I see the drop printed. If I kill -9 the pid it never drops.

use std::{thread, time};

struct Droppable {
    name: &'static str,
}

impl Drop for Droppable {
    fn drop(&mut self) {
        println!("> Dropping {}", self.name);
    }
}

fn main() {
    let _a = Droppable { name: "testing" };

    let delay = time::Duration::from_secs(30);

    thread::sleep(delay);
}
bleggett commented 1 year ago

Real basic unscientific local test. If I let this run its course I see the drop printed. If I kill -9 the pid it never drops.

use std::{thread, time};

struct Droppable {
    name: &'static str,
}

impl Drop for Droppable {
    fn drop(&mut self) {
        println!("> Dropping {}", self.name);
    }
}

fn main() {
    let _a = Droppable { name: "testing" };

    let delay = time::Duration::from_secs(30);

    thread::sleep(delay);
}

It doesn't matter tho - there's no implicit wait in that scenario. When you get SIGTERM you start dropping connections as fast as possible. Whether they're all dropped by the time you get the followup SIGKILL is moot, you (ztunnel) have no control over that anyway and don't need a wait-window to deal with it.

ilrudie commented 1 year ago

The test was just whether the drop mechanism in Rust would somehow allow it to send a RST after getting SIGKILL which seems to be no.

howardjohn commented 1 year ago

We don't drop connections on SIGTERM, otherwise its not a graceful shutdown. if we did then its trivial. The SIGTERM signals "stop accepting new connections, and wait for the clients to politely close connections". There is no point closing the client connections early unless we are about to shutdown immediately after.

bleggett commented 1 year ago

We don't drop connections on SIGTERM, otherwise its not a graceful shutdown. if we did then its trivial. The SIGTERM signals "stop accepting new connections, and wait for the clients to politely close connections". There is no point closing the client connections early unless we are about to shutdown immediately after.

That depends so much on specific client config, is it very useful as a ztunnel setting? We could just as easily stop redirecting new connections to ztunnel in e.g. CNI, in which case a refusal window in zt proper doesn't matter, drains of existing connections is all that's needed.

Even without that tho I'm not sure it's very useful.

If ztunnel's TERMINATION_GRACE_PERIOD == 30 secs (and the platforms is, say, 1 sec greater, 31 sec)

howardjohn commented 1 year ago

Yes, long lived connections are broken. That is why we should avoid killing ztunnel as much as possible (for example, time it with a node upgrade), and use very long grace periods when we do have to (far longer than 30s - more like 30h)

howardjohn commented 1 year ago

In waypoint/sidecar this is better btw, since we can send GOAWAY and connection: close. But cannot signal to apps for raw TCP

bleggett commented 1 year ago

Do we care at all about ztunnel needing to gracefully (i.e. with a window) terminate its upstream connections?

That's the only scenario where I think TERMINATION_GRACE_PERIOD is really needed as a ztunnel setting, to @costinm 's point

I'm not sure what gracefully terminating upstream proxied connections gets us unless we want to wait and kill the downstream connection until after we've gracefully terminated the corresponding upstream one.

costinm commented 1 year ago

I am 100% sure killing a process sends RST.

Upstream tcp connections can't be gracefully closed either - we don't control the ends, would be a bug if ztunnel sent a FIN when the local pod didn't initiate it and may have in flight data.

I do agree ztunnel stop should be very long - but that may be independent. You can cordon a node for 24 h - and if we implement the swapping, whatever controls the process can coordinate it independent of k8s process( which is not designed for node-wide critical services).

On Tue, May 16, 2023, 12:52 John Howard @.***> wrote:

The claim about was SIGKILL would not result in RSTs. If actually does, I agree. If it does not thenI think we need that?

On Tue, May 16, 2023 at 12:34 PM Costin Manolache @.***> wrote:

On Tue, May 16, 2023 at 9:31 AM Ben Leggett @.***> wrote:

Yet another NIH way to stop processes ?

No, just that ztunnel, like all proxies, needs to be informed of what time window it has to gracefully FIN connections.

Ztunnel can't FIN the streams forwarded to the app - app can gracefully FIN.

Ztunnel can send GO-AWAY or stop accepting - and for that we have SIGTERM.

Envoy and ztunnel as a sidecar are a different story from ztunnel as per-node server.

The time window ztunnel has to gracefully FIN connections is the time window between SIGTERM and SIGKILL.

Once ztunnel gets SIGKILL, it's out of time and cannot do anything in response to that but be killed immediately.

So, if the platform (k8s, systemd) has a grace period after SIGTERM before it sends SIGKILL, ztunnel needs to know that it has LESS-THAN that amount of time to wait for clients to gracefully accept FIN-style connection closes, before forcefully sending RST connection closes.

I still don't understand what ztunnel would do after between the end of the drain duration set and SIGKILL. After SIGTERM - stop accept, send GO-AWAY. After DRAIN_DURATION - ???? SIGKILL automatically sends RST on all remaining connections.

Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/istio/ztunnel/issues/519#issuecomment-1550248445, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAEYGXNI6L7FGIONLUEWVXLXGPJDFANCNFSM6AAAAAAYCJDDWE

. You are receiving this because you commented.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/istio/ztunnel/issues/519#issuecomment-1550273231, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAUR2U7BMAPWJIPYRDHJPTXGPLIDANCNFSM6AAAAAAYCJDDWE . You are receiving this because you were mentioned.Message ID: @.***>

costinm commented 1 year ago

Sorry - kernel sends FIN actually when the process dies, RST in some cases.

On Tue, May 16, 2023, 14:21 Costin Manolache @.***> wrote:

I am 100% sure killing a process sends RST.

Upstream tcp connections can't be gracefully closed either - we don't control the ends, would be a bug if ztunnel sent a FIN when the local pod didn't initiate it and may have in flight data.

I do agree ztunnel stop should be very long - but that may be independent. You can cordon a node for 24 h - and if we implement the swapping, whatever controls the process can coordinate it independent of k8s process( which is not designed for node-wide critical services).

On Tue, May 16, 2023, 12:52 John Howard @.***> wrote:

The claim about was SIGKILL would not result in RSTs. If actually does, I agree. If it does not thenI think we need that?

On Tue, May 16, 2023 at 12:34 PM Costin Manolache @.***> wrote:

On Tue, May 16, 2023 at 9:31 AM Ben Leggett @.***> wrote:

Yet another NIH way to stop processes ?

No, just that ztunnel, like all proxies, needs to be informed of what time window it has to gracefully FIN connections.

Ztunnel can't FIN the streams forwarded to the app - app can gracefully FIN.

Ztunnel can send GO-AWAY or stop accepting - and for that we have SIGTERM.

Envoy and ztunnel as a sidecar are a different story from ztunnel as per-node server.

The time window ztunnel has to gracefully FIN connections is the time window between SIGTERM and SIGKILL.

Once ztunnel gets SIGKILL, it's out of time and cannot do anything in response to that but be killed immediately.

So, if the platform (k8s, systemd) has a grace period after SIGTERM before it sends SIGKILL, ztunnel needs to know that it has LESS-THAN that amount of time to wait for clients to gracefully accept FIN-style connection closes, before forcefully sending RST connection closes.

I still don't understand what ztunnel would do after between the end of the drain duration set and SIGKILL. After SIGTERM - stop accept, send GO-AWAY. After DRAIN_DURATION - ???? SIGKILL automatically sends RST on all remaining connections.

Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/istio/ztunnel/issues/519#issuecomment-1550248445, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAEYGXNI6L7FGIONLUEWVXLXGPJDFANCNFSM6AAAAAAYCJDDWE

. You are receiving this because you commented.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/istio/ztunnel/issues/519#issuecomment-1550273231, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAUR2U7BMAPWJIPYRDHJPTXGPLIDANCNFSM6AAAAAAYCJDDWE . You are receiving this because you were mentioned.Message ID: @.***>

bleggett commented 1 year ago

Sorry - kernel sends FIN actually when the process dies, RST in some cases.

FIN when client, RST when server, on socket events, regardless of whether the process is still technically there or not - ok, that makes sense.

Upstream tcp connections can't be gracefully closed either - we don't control the ends, would be a bug if ztunnel sent a FIN when the local pod didn't initiate it and may have in flight data.

Yep, I was thinking more we might need to signal the server in a timely fashion that the connection has ended.

But it sounds like a courtesy FIN will be sent either way, for that purpose.

Ok, then to summarize - I don't see a good argument for a discrete ztunnel config setting for termination grace.

How long it stays up is not up to ztunnel to manage, and whether new connections are redirected to it isn't up to ztunnel to manage either - in fact to extend @ilrudie's point above, ztunnel itself actively refusing connections still being routed to it would likely cause issues with clients not expecting that, versus just proactively redirecting new connections and externally starving ztunnel of new ones.

costinm commented 1 year ago

Termination is a very tricky subject. Sorry for the misinformation and confusion between RST and FIN, it's been ~10 years since I debugged FIN/RST in depth. There is much more - some kernels throttle and drop FIN/RST after a certain rate, TLS has it's own termination handshake - and usually applications making TCP connections don't have good ways to differentiate or detect if the connection is still alive. And a machine crash or network problems may also result in lost FIN/RST.

Even in normal operation (not shutdown) - I don't know if ztunnel can detect if either side sends a RST instead of FIN and propagate it.

But I think the main issue here is that while the upgrade for Ztunnel requires a (very) long drain, this env variable doesn't seem the right place. CNI + code that will drive the shift/upgrade - or whatever drives the node cordon process will have the power to coordinate this.

One thing to research is if when a node is upgraded, after cordon, the pods are terminated in any order and what is the timeout on the node shutdown, i.e. what happens if a pod takes 1 week to terminate ? If ztunnel is the last standing - we should be safe.

On Tue, May 16, 2023 at 5:24 PM Ben Leggett @.***> wrote:

Sorry - kernel sends FIN actually when the process dies, RST in some cases.

FIN when client, RST when server, on socket events, regardless of whether the process is still technically there or not - ok, that makes sense.

Upstream tcp connections can't be gracefully closed either - we don't control the ends, would be a bug if ztunnel sent a FIN when the local pod didn't initiate it and may have in flight data.

Yep, I was thinking more we might need to signal the server in a timely fashion that the connection has ended.

But it sounds like a courtesy FIN will be sent either way, for that purpose.

Ok, then to summarize - I don't see a good argument for a discrete ztunnel config setting for termination grace.

How long it stays up is not up to ztunnel to manage, and whether new connections are redirected to it isn't up to ztunnel to manage either - in fact to extend @ilrudie https://github.com/ilrudie's point above, ztunnel itself actively refusing connections still being routed to it would likely cause issues with clients not expecting that.

— Reply to this email directly, view it on GitHub https://github.com/istio/ztunnel/issues/519#issuecomment-1550518796, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAUR2SHVTFUVOIDEPOMZODXGQLDPANCNFSM6AAAAAAYCJDDWE . You are receiving this because you were mentioned.Message ID: @.***>

bleggett commented 1 year ago

Agree with that, and unless there are objections I'll

howardjohn commented 1 year ago

ztunnel itself actively refusing connections still being routed to it would likely cause issues with clients not expecting that, versus just proactively redirecting new connections and externally starving ztunnel of new ones.

Not sure I agree with this necessarily. Its better to reject a new connection then to accept a connection and drop it later.

However, we also cannot start rejecting immediately since there is some propogation delay before clients stop sending us new connections. Probably fine to just accept them, but I don't think its ideal

bleggett commented 1 year ago

Not sure I agree with this necessarily. Its better to reject a new connection then to accept a connection and drop it later.

Agree, but I'm saying that ideally (and probably necessarily)cni is in charge of whether a ztunnel pod sees a new connection to begin with, and can simply choose to not send any new connections to a (cordoned, or obsoleted, or whatever) ztunnel pod, starving it naturally. ztunnel doesn't need to reject new connections because it would never see them, once cni decided to begin starving it.

tl;dr starve-and-terminate versus reject-drain-and-terminate - with the goal being that by the time you actually need to kill ztunnel under that model it should have a conncount of 0, or close to it.

Ultimately that's the model we actually have to get to, I think - to @costinm's point.

ilrudie commented 1 year ago

Maybe I just missed it but I don't think the proposal has been summarized in one post. As I understand it what's being put forth is something like:

  1. Decision to terminate a ztunnel is made and externally implemented. Ztunnel is not aware of this and continues to operate normally.
  2. "Grace period" for connection draining occurs.
  3. ztunnel receives SIGTERM. Upon handling the signal it immediately begins closing any existing connections (which would ideally be few/zero after the drain).
  4. ztunnel has no more work to do and terminates gracefully or get's SIGKILLed.
  5. (if necessary) Kernel closes any remaining connections ztunnel wasn't able to close itself.
howardjohn commented 1 year ago

I am not sure I agree. Seems like the first signal should be a SIGTERM, and SIGTERM does not mean closing connections.

  1. Decision to terminate a ztunnel is made and externally implemented. Ztunnel is not aware of this and continues to operate normally. This is a SIGTERM from Kubernetes.
  2. During this, we wait for connections to exit due to natural causes. External system should ensure no new connections are established.
  3. At any point ztunnel can exit due to having 0 open connections. We might want to wait a few seconds since the signal to start sending traffic to another ztunnel has eventual consistency, so we will still get new connections at first. However, I suspect its better to drop those than to accept them (and keep us alive or have to force close the connection later)
  4. ...wait...
  5. ztunnel receives SIGKILL. It dies
  6. (if necessary) Kernel closes any remaining connections ztunnel wasn't able to close itself.
costinm commented 1 year ago

John: there are 2 upgrade models mixed in this thread. For 'cordon' model, ztunnel continues accepting connections and working normally. The node will not schedule new pods and existing pods will be stopped. This is commonly used to upgrade the kernel or kubelet on nodes.

The second mode is a hot swap driven by CNI - new ztunnel on same node. I don't know how we want to do this - if it is a brand new pod, we also need to keep listening and process everything until the new pod is ready and propagated, followed by a goaway so new streams go to new pod. If it is using a new ztunnel binary in the same pod - without CNI intervention - the second ztunnel could listen on same port and old one stop accepting. I guess this is different enough to be treated as 3rd mode.

On Wed, May 17, 2023, 08:19 Ian Rudie @.***> wrote:

Maybe I just missed it but I don't think the proposal has been summarized in one post. As I understand it what's being put forth is something like:

  1. Decision to terminate a ztunnel is made and externally implemented. Ztunnel is not aware of this and continues to operate normally

  2. "Grace period" for connection draining occurs.

  3. ztunnel receives SIGTERM. Upon handling the signal it immediately begins closing any existing connections (which would ideally be few/zero after the drain).

Send goaway. Maybe stop accepting new connections ( mode 3). But not close existing connections.

If the connections naturally close and no new ones get created - which would happen since xds advertises the new pod (in mode 2) or pods are terminated (mode 1) - if connection count is zero it may exit. Maybe... We need to put each mode in a doc and consider all cases...

  1. ztunnel has no more work to do and terminates gracefully or get's SIGKILLed.

That's the question - how to determine no more work is true. Mode 1 - no more pods on the node. Mode 2 - zero connections, but what if XDS update didn't propagate and some client keeps sending to old pod ? Can we just let it alive and not used until next upgrade ( blue/green).

  1. (if necessary) Kernel closes any remaining connections ztunnel wasn't able to close itself.

I don't think ztunnel should ever close streams. Idle h2 connections - yes, goaway - in mode 2 and 3 ( not 1)

When kubelet runs out of patience - it'll be killed by kernel and nothing we can do, close will be called.

— Reply to this email directly, view it on GitHub https://github.com/istio/ztunnel/issues/519#issuecomment-1551596362, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAUR2TGJRZGVSEYQQ4IQY3XGTUBNANCNFSM6AAAAAAYCJDDWE . You are receiving this because you were mentioned.Message ID: @.***>

howardjohn commented 1 year ago

Cordon still tries to terminate ztunnel, no? And thus gets a SIGTERM?

In all cases the signal to start "draining" is a SIGTERM in my mind?

Where "Drain" means "exit when we have no work left (zero connections)"

GOAWAY

I don't think this is useful since we are an L4 proxy. Closing the h2 connection vs GOAWAY is effectively the same -- the client ztunnel will close the app connection. We have no mechanism to signal to the application it should close the connection.

ilrudie commented 1 year ago

+1 for document the models more clearly so we can understand what exactly we're agreeing on.

I like @howardjohn's way to begin with SIGTERM to the ztunnel I think. At least in a k8s context it seems like a good signal to be able to handle as gracefully as possible. It wouldn't preclude some other way of initiating a drain event necessarily but would likely provide a better outcome in the event that a cluster administrator didn't follow istio-specific drain steps.

bleggett commented 1 year ago

In all cases the signal to start "draining" is a SIGTERM in my mind?

That's fair if we want that model - even in that case, we still don't need TERMINATION_GRACE_PERIOD, so for now it makes sense to proceed with removing that if you're good with it.

ilrudie commented 1 year ago

kubectl cordon <node> prevent scheduling of most new pods kubectl drain <node> evict everything that's not managed by a daemonset, while respecting PDBs kubectl drain --ignore-daemonsets <node> evict everything including pods managed by a daemonset

Drain also cordons the node.