Closed cchatfield closed 2 years ago
With GKE node spot instances, the graceful termination period is 30s.
The lowest I can configure lame_duck_duration is 30s.
It seems like sometimes the server is exiting ok and other times, it is corrupt. Not sure if the attached disk is closed properly or if the server still has open handles to the files.
How can I see in the logs that all server processes have completed and are fully shutdown.
Is there a way to shorten the lame duck duration + lame_duck_grace_period + overhead to be at or under 30s?
Hi @cchatfield, the -DV
output appears to have closed prematurely due to a port allocation issue. Can you reproduce the issue ensuring all the output is captured? If the server shuts down properly, you should see log messages indicating this.
Is there a way to shorten the lame duck duration + lame_duck_grace_period + overhead to be at or under 30s?
I believe the minimum would be 30s if the grace period is set to zero. Not sure if the hard min limit of 30s for the duration can be lowered @derekcollison? I am reading here that GKE provides 25s for graceful shutdown of non-system pods.
However stepping back, I am curious what the use case is for using spot instances? Is the desire to minimize cost and have them seamlessly swap out over time? They seem to be optimized for stateless workloads and the best practices section seem to indicate that both internal and external IPs can change and local SSDs with attached spot VMs are not persisted?
Although NATS is definitely tolerant to node failures, the behavior is somewhat unclear with spot instances.
The instances experiencing the problem have been removed and recreated, so unfortunately I am unable to grab further output.
Use case: I use nats for short-term messaging. I don't really need more than a few days of message history and realistically about an hour. In dev/test I have 3 pods on 2 spot vms. In prod, I have a dedicated pool of 5 spot nodes to run 5 pods (cluster).
I am using spot purely for cost. For prod with spot 22.95/month. Without spot 182.15.
I have been trying to see if I could use the cost savings for the larger deployment in prod and it was fairly stable with 2.8.2. I just upgraded to 2.8.4 and it seems to be more problematic. All of that said, if there isn't a viable path for using spot instances, I can change my prod deployment to a smaller cluster with long term commitments.
Thanks. Just to confirm I understand this bit:
I don't really need more than a few days of message history and realistically about an hour.
Message history (the stream data) can have different retention policies for short-term use cases. Is it the case that messages are published intermittently over a few and then once those messages are processed "the work is done"? Is this a fixed window of time the nodes need to be up? Have you ran into situations where the majority of nodes are unavailable and publishing fails?
Ultimately is sounds like you want a pay-per-usage model rather than pay for uptime (as some would call serverless).
It is intermittently published (due to levels of activity). I expect to scale to consistent messages, but was looking for cost reduction in the meantime. The cost of troubleshooting a node or stream failure really isn't worth the cost differential for spot vs reserved. I will just use reserved stable instance in prod.
Defect
Make sure that these boxes are checked before submitting your issue -- thank you!
nats-server -DV
outputVersions of
nats-server
and affected client libraries used:nats-alpine:2.8.4 github.com/nats-io/nats.go v1.14.0
OS/Container environment:
GKE - 1.22.11-gke.400
Steps or code to reproduce the issue:
Start cluster with 3 replicas and drop one of the pods. No messages should have been written.
Expected result:
cluster would reallocate consumers streams once pod has been restored
Actual result:
consumer reallocate fails with