Cluster does not handle node failure when active consumers, but no messages stored yet

cchatfield commented 2 years ago

Defect

Make sure that these boxes are checked before submitting your issue -- thank you!

[x] Included nats-server -DV output
[x] Included a [Minimal, Complete, and Verifiable example] (https://stackoverflow.com/help/mcve)

Versions of `nats-server` and affected client libraries used:

nats-alpine:2.8.4 github.com/nats-io/nats.go v1.14.0

OS/Container environment:

GKE - 1.22.11-gke.400

Steps or code to reproduce the issue:

Start cluster with 3 replicas and drop one of the pods. No messages should have been written.

Expected result:

cluster would reallocate consumers streams once pod has been restored

Actual result:

consumer reallocate fails with

...
2022/08/24 00:57:52.068833 [INF] Recovering 36 consumers for stream - '$G > orders'
2022/08/24 00:57:52.095499 [WRN] Error unmarshalling consumer metafile "/data/jetstream/$G/streams/orders/obs/order_CreateActiveCampaignOrderCommand_order_CreateActiveCampaignOrder/meta.inf": unexpected end of JSON input 
2022/08/24 00:57:52.100074 [WRN] Error unmarshalling consumer metafile "/data/jetstream/$G/streams/orders/obs/order_CreateStripeInvoiceCommand_invoice_CreateStripeInvoice/meta.inf": unexpected end of JSON input
2022/08/24 00:57:52.102469 [WRN] Error unmarshalling consumer metafile "/data/jetstream/$G/streams/orders/obs/order_CustomerForStripeInvoiceRetrieved_invoice_CustomerForStripeInvoiceRetrieved/meta.inf": unexpected end of JSON input
2022/08/24 00:57:52.108723 [WRN] Error unmarshalling consumer metafile "/data/jetstream/$G/streams/orders/obs/order_OrderFinalizedForActiveCampaignEvent_order_OrderFinalized/meta.inf": unexpected end of JSON input
2022/08/24 00:57:52.111255 [WRN] Error unmarshalling consumer metafile "/data/jetstream/$G/streams/orders/obs/order_QuickbooksInvoiceCreatedEvent_invoice_QuickbooksInvoiceCreated/meta.inf": unexpected end of JSON input
2022/08/24 00:57:52.111863 [WRN] Error unmarshalling consumer metafile "/data/jetstream/$G/streams/orders/obs/order_QuickbooksInvoiceCreatedForQuickbooksEvent_invoice_QuickbooksInvoiceCreated/meta.inf": unexpected end of JSON input
2022/08/24 00:57:52.112476 [WRN] Error unmarshalling consumer metafile "/data/jetstream/$G/streams/orders/obs/order_QuickbooksItemForSKURetrievedEvent_invoice_QuickbooksItemForSKURetrieved/meta.inf": unexpected end of JSON input
2022/08/24 00:57:52.119891 [WRN] Error unmarshalling consumer metafile "/data/jetstream/$G/streams/orders/obs/order_RetrieveQuickbooksItemForSKUCommand_invoice_RetrieveQuickbooksItemForSKU/meta.inf": unexpected end of JSON input
2022/08/24 00:57:52.120464 [WRN] Error unmarshalling consumer metafile "/data/jetstream/$G/streams/orders/obs/order_RetrieveQuickbooksItemsForSKUCommand_invoice_RetrieveQuickbooksItemsForSKU/meta.inf": unexpected end of JSON input
2022/08/24 00:57:52.125903 [WRN] Error unmarshalling consumer metafile "/data/jetstream/$G/streams/orders/obs/order_StripeInvoiceItemAddedEvent_invoice_StripeInvoiceItemAdded/meta.inf": unexpected end of JSON input
...
2022/08/24 00:57:52.190412 [WRN] RAFT [cnrtt3eg - _meta_] Snapshot corrupt, too short
...
2022/08/24 00:57:53.009951 [WRN] RAFT [cnrtt3eg - C-R3F-2Ku9JZKq] Error storing entry to WAL: raft: could not store entry to WAL
2022/08/24 00:57:53.009978 [WRN] RAFT [cnrtt3eg - C-R3F-2Ku9JZKq] Expected first catchup entry to be a snapshot and peerstate, will retry
... WAL ERRORS repeat continuously

/ # nats-server -DV
[76] 2022/08/24 01:21:23.160500 [INF] Starting nats-server
[76] 2022/08/24 01:21:23.160590 [INF]   Version:  2.8.4
[76] 2022/08/24 01:21:23.160605 [INF]   Git:      [66524ed]
[76] 2022/08/24 01:21:23.160616 [DBG]   Go build: go1.17.10
[76] 2022/08/24 01:21:23.160640 [INF]   Name:     NCYMENHT3CJCYIRW3GWPGGZOVAXXXRYJO7E6YFZ5IT5OUBIFUWUD6M6P
[76] 2022/08/24 01:21:23.160659 [INF]   ID:       NCYMENHT3CJCYIRW3GWPGGZOVAXXXRYJO7E6YFZ5IT5OUBIFUWUD6M6P
[76] 2022/08/24 01:21:23.160742 [DBG] Created system account: "$SYS"
[76] 2022/08/24 01:21:23.161243 [FTL] Error listening on port: 0.0.0.0:4222, "listen tcp 0.0.0.0:4222: bind: address already in use"

cchatfield commented 2 years ago

With GKE node spot instances, the graceful termination period is 30s.

The lowest I can configure lame_duck_duration is 30s.

It seems like sometimes the server is exiting ok and other times, it is corrupt. Not sure if the attached disk is closed properly or if the server still has open handles to the files.

How can I see in the logs that all server processes have completed and are fully shutdown.

Is there a way to shorten the lame duck duration + lame_duck_grace_period + overhead to be at or under 30s?

bruth commented 2 years ago

Hi @cchatfield, the -DV output appears to have closed prematurely due to a port allocation issue. Can you reproduce the issue ensuring all the output is captured? If the server shuts down properly, you should see log messages indicating this.

Is there a way to shorten the lame duck duration + lame_duck_grace_period + overhead to be at or under 30s?

I believe the minimum would be 30s if the grace period is set to zero. Not sure if the hard min limit of 30s for the duration can be lowered @derekcollison? I am reading here that GKE provides 25s for graceful shutdown of non-system pods.

However stepping back, I am curious what the use case is for using spot instances? Is the desire to minimize cost and have them seamlessly swap out over time? They seem to be optimized for stateless workloads and the best practices section seem to indicate that both internal and external IPs can change and local SSDs with attached spot VMs are not persisted?

Although NATS is definitely tolerant to node failures, the behavior is somewhat unclear with spot instances.

cchatfield commented 2 years ago

The instances experiencing the problem have been removed and recreated, so unfortunately I am unable to grab further output.

Use case: I use nats for short-term messaging. I don't really need more than a few days of message history and realistically about an hour. In dev/test I have 3 pods on 2 spot vms. In prod, I have a dedicated pool of 5 spot nodes to run 5 pods (cluster).

I am using spot purely for cost. For prod with spot 22.95/month. Without spot 182.15.

I have been trying to see if I could use the cost savings for the larger deployment in prod and it was fairly stable with 2.8.2. I just upgraded to 2.8.4 and it seems to be more problematic. All of that said, if there isn't a viable path for using spot instances, I can change my prod deployment to a smaller cluster with long term commitments.

bruth commented 2 years ago

Thanks. Just to confirm I understand this bit:

I don't really need more than a few days of message history and realistically about an hour.

Message history (the stream data) can have different retention policies for short-term use cases. Is it the case that messages are published intermittently over a few and then once those messages are processed "the work is done"? Is this a fixed window of time the nodes need to be up? Have you ran into situations where the majority of nodes are unavailable and publishing fails?

Ultimately is sounds like you want a pay-per-usage model rather than pay for uptime (as some would call serverless).

cchatfield commented 2 years ago

It is intermittently published (due to levels of activity). I expect to scale to consistent messages, but was looking for cost reduction in the meantime. The cost of troubleshooting a node or stream failure really isn't worth the cost differential for spot vs reserved. I will just use reserved stable instance in prod.

nats-io / nats-server