nats-io / nats.go

Golang client for NATS, the cloud native messaging system.
https://nats.io
Apache License 2.0
5.55k stars 696 forks source link

JetStream OrderedConsumer fails to Consume messages upon reconnecting #1370

Closed r-delgadillo closed 10 months ago

r-delgadillo commented 1 year ago

Defect

JetStream OrderedConsumer fails to reconnect upon hard reset on NATs server. Looking at the GoLang package here, it feels like we should add a conditional check if the err received from the server is "Consumer not found" err so we can perform a "reset" on the client which then recreates the Consumer for processing.

Have a Slack channel going here.

Make sure that these boxes are checked before submitting your issue -- thank you!

Versions of nats.go and the nats-server if one was involved:

nats.go: 1.28.0 nats-server: 2.9.21

OS/Container environment:

Kubernetes Minikube Cluster. GoLang Linux

Steps or code to reproduce the issue:

  1. Create Jetstream
  2. Send messages to Stream
  3. Create Ordered Consumer
  4. Hard restart NATs server (Kill/restart pod)
  5. Observe the client reconnect handler successfully reconnects
  6. Consumer attempts to process messages but fails with "Consumer not found" error

Expected result:

Expect OrderedConsumer to recreate the Consumer and continue processing messages where it had left off before the NATs server restarted.

Actual result:

Consumer fails to "reset" and fails to continue processing messages.

Jarema commented 1 year ago

Thanks for filling the issue!

Ordered Pull Consumer should send new fetch requests to the server when it sees reconnect event, and Consumer is not deleted because server restarted. It probably is deleted because until server is up and running again, the InactiveThreshold is reached.

Increasing InactiveThreshold to ~30 seconds would be a good test to check if that's the culprit.

I'm not sure about recreating ordered consumer if it's not there - forcing recreation would make InactiveThreshold config obsolete and would also not allow operators of the cluster to permanently remove specific consuer, as Client would immediately recreate it.

We will take a look at the code.

r-delgadillo commented 1 year ago

@Jarema I attempted to increase the InactiveThreshold to 1 Hour and still no luck. It seems as Consumers are being deleted immediately and not honoring the InactiveThreshold on a hard restart. Below is a screenshot of my Consumers from the NATs box:

image

The reset method in ./jetstream/ordered.go already does recreation of Consumers for certain errors. So it seems like the config is already obsolete to some extent?

Thank you for looking into this.

r-delgadillo commented 1 year ago

Forgot to mention, OrderedConsumers allow the caller to set the MaxResetAttempts -which recreates the Consumer. It seems like recreating the Consumer is by design. To your point, it feels like InactiveThreshold kind of contridicts this setting.

Jarema commented 1 year ago

@piotrpio rightfully pointed out, that the OrderedConsuemer cannot work after server restart, as it uses memory storage on the server. Because of that, server restart wipes Consumer's state.

We will address that by recreating Consumer on Missing Heartbeat event (as not every reconnection means server restart).

Can you elaborate why you think InactiveThreshold contradicts MaxResetAttempts?

r-delgadillo commented 1 year ago

Nice, good find.

Regarding InactiveThreshold & MaxResetAttempts, I was carrying on thinking from a comment you made, "forcing recreation would make InactiveThreshold config obsolete." Thinking about it a bit more, I think recreating the Consumer in certain scenarios - such as missing Heartbeat makes sense. Recreating the Consumer if InactiveThreshold is reached does not make sense - which I believe is what you were saying earlier.

I'm curious how release trains look for nats.go. Do they occur on a certain schedule or are they more on-demand? I'm wondering what the timeframe for the fix looks like.

r-delgadillo commented 1 year ago

@Jarema friendly ping to see if you have any insights on the fix release train. Thanks

piotrpio commented 1 year ago

@r-delgadillo sorry for the delayed response. Releases are done on-demand, but we usually try to do them monthly at least, so we are a bit overdue! I aim to publish the next release next week, I'll make sure to include resetting consumers on missing heartbeats and not resetting them after inactive threshold is reached.

mhill-holoplot commented 1 year ago

I was also having this problem with both the new and legacy (push and pull) jetsream API . In the OrderedConsumer case I sometimes didn't get any error arriving at my ConsumeErrHandlerFunc which corresponded with consumer not resetting and therefore silently failing to receive messages. Most of the time I did get an error and the Consumer would reset and pick up receiving messages.

bdejong commented 1 year ago

@piotrpio is the fix for this in the 2.9..22 release?

piotrpio commented 1 year ago

@bdejong this is a fix in the client, not server, so not yet available. Sorry for the delay, release will hopefully be out soon.

r-delgadillo commented 1 year ago

@piotrpio any updates on this bug, thanks

piotrpio commented 1 year ago

@r-delgadillo Again, sorry for the radio silence. I created a PR which should solve this issue: https://github.com/nats-io/nats.go/pull/1425. Sorry it did not make it to the previous release.

r-delgadillo commented 1 year ago

@piotrpio all good, thank you for the update. Looking forward to have this during the next client release.

r-delgadillo commented 1 year ago

@piotrpio have any updates been made on the PR? Thanks

piotrpio commented 10 months ago

Sorry for not updating this issue... Fix has been released in v1.31.0