Closed nazar-pc closed 2 weeks ago
Current suspicion is that executor might have been blocked on a thread that async-nats's task was sharing, will test this theory.
Hey!
First of all, keep in mind that it's totally normal for the client to not know that the connection is down, especially if there is not much traffic coming through.
That's why there is PING
in Nats protocol, which makes sure that the connection is fuctioning fine.
In most cases, the client will figure out that the TCP connection is down (or rather, TCPStream will), but, not always.
In those cases, client rely on the PING
.
If you want to make the client more sensitive, you can specify lower ping interval when creating the client.
When it comes to the publish - in Core NATS, client has a publish buffer that aggregates some messages to optimize the throughput. If connection is stale, client will buffer those messages, and try to send them, but until buffer is full, it will not error. You can tweak this using the client capacity.
For getting events on connection, use events_callback
I understand, but I don't believe there was a network issue for connection to be dropped in the first place since other clients running on the same exact machine are fine and that means TCP connection will not go down for no reason.
I checked whether executor was blocked and I don't believe that was the case.
According to latest investigation details even sending channel inside of NATS client was clogged, so somehow NATS client was neither sending nor receiving messages for quite a few seconds. Here is a delay on client.publish()
call:
2024-07-03T11:24:16.106925Z subspace_farmer::cluster::nats_client: Publishing stream response messages response_subject=stream-response.01J1W455837Z5XPXG1P8Z2W8E5 num_messages=1 index=250 is_done=false
2024-07-03T11:24:30.439486Z subspace_farmer::cluster::nats_client: Waiting for acknowledgement response_subject=stream-response.01J1W455837Z5XPXG1P8Z2W8E5 expected_index=249
And client did not receive messages from the server in the middle of that:
[1] 2024/07/03 11:24:26.072162 [INF] 10.20.120.11:52368 - cid:9 - Slow Consumer Detected: WriteDeadline of 10s exceeded with 246 chunks of 15731404 total bytes.
I asked user to run an app with async_nats=trace
, but wondering if someone can spot an edge-case since the code handling sending and receiving is non-trivial.
async-nats logging is not super consistent (.request()
results in trace message and .publish()
doesn't for example), the only additional piece information with trace level that was received is that connection is closed this way:
2024-07-08T08:09:22.691898Z DEBUG async_nats: retry on initial connect failure is disabled
2024-07-08T08:09:22.691909Z DEBUG async_nats::connector: connecting
2024-07-08T08:09:22.696141Z DEBUG async_nats::connector: connected to 4222
2024-07-08T08:09:22.697733Z DEBUG async_nats: received PONG
2024-07-08T08:10:24.865933Z DEBUG async_nats: received PONG
2024-07-08T09:24:57.076777Z DEBUG async_nats: disconnected err=Some(Kind(ConnectionReset))
2024-07-08T09:24:57.076839Z DEBUG async_nats::connector: connecting
2024-07-08T09:24:59.463834Z DEBUG async_nats::connector: connected to 4222
2024-07-08T09:24:59.464263Z DEBUG async_nats: reconnected
2024-07-08T10:10:52.426351Z DEBUG async_nats: disconnected err=Some(Kind(ConnectionReset))
2024-07-08T10:10:52.426408Z DEBUG async_nats::connector: connecting
2024-07-08T10:10:52.433101Z DEBUG async_nats::connector: connected to 4222
2024-07-08T10:10:52.433268Z DEBUG async_nats: reconnected
2024-07-08T10:26:50.410512Z DEBUG async_nats: disconnected err=Some(Kind(ConnectionReset))
2024-07-08T10:26:50.410583Z DEBUG async_nats::connector: connecting
2024-07-08T10:26:52.195553Z DEBUG async_nats::connector: connected to 4222
2024-07-08T10:26:52.195798Z DEBUG async_nats: reconnected
2024-07-08T10:30:37.433346Z DEBUG async_nats: disconnected err=Some(Kind(ConnectionReset))
2024-07-08T10:30:37.433405Z DEBUG async_nats::connector: connecting
2024-07-08T10:30:38.498512Z DEBUG async_nats::connector: connected to 4222
2024-07-08T10:30:38.498682Z DEBUG async_nats: reconnected
2024-07-08T11:25:05.451461Z DEBUG async_nats: disconnected err=Some(Kind(ConnectionReset))
2024-07-08T11:25:05.451536Z DEBUG async_nats::connector: connecting
2024-07-08T11:25:05.641422Z DEBUG async_nats::connector: connected to 4222
2024-07-08T11:25:05.738965Z DEBUG async_nats: reconnected
2024-07-08T12:55:19.602283Z DEBUG async_nats: disconnected err=Some(Os { code: 104, kind: ConnectionReset, message: "Connection reset by peer" })
2024-07-08T12:55:19.602340Z DEBUG async_nats::connector: connecting
2024-07-08T12:55:22.762921Z DEBUG async_nats::connector: connected to 4222
2024-07-08T12:55:22.768459Z DEBUG async_nats: reconnected
2024-07-08T12:57:23.732625Z DEBUG async_nats: received PONG
2024-07-08T14:53:43.206239Z DEBUG async_nats: disconnected err=Some(Os { code: 104, kind: ConnectionReset, message: "Connection reset by peer" })
2024-07-08T14:53:43.210118Z DEBUG async_nats::connector: connecting
2024-07-08T14:53:43.220725Z DEBUG async_nats::connector: connected to 4222
2024-07-08T14:53:43.221056Z DEBUG async_nats: reconnected
2024-07-08T15:27:18.218409Z DEBUG async_nats: received PONG
Looks like it doesn't send pings unless there is a lack of other activity, which makes sense, but why would it disconnect so frequently with ConnectionReset
?
After reading through various tokio issues I have found https://github.com/tokio-rs/tokio/issues/4941 and decided to try https://github.com/tokio-rs/tokio/pull/4936 to work around it, but client still stops receiving messages the same way from time to time.
What else can I do to find why connection breaks? I see no reason for this to happen, yet it does.
I wonder - Isn't this a case for a Slow Consumer? If there is a lot of messages, and client subscriber cannot keep up with the rate, at some point the server will diwconnect slow client to protect itself.
Please chceck the server logs.
Also: can you provide reproduction, so I can run it on my machine?
Server logs slow consumer (check above messages), but client doesn't print slow consumer, so it is fine. Rust client just simply stops sending and receiving messages occasionally and later wakes up all of a sudden. I'm quite confident client is able to keep up with orders of magnitude higher load.
There is not reproduction unfortunately and it is not 100% deterministic, it just happens from once/twice an hour to once every few hours. The app is quite large (though 100% open source) and extremely heavy on CPU and RAM, so I would not necessarily recommend trying to run it.
NATS server and client hardware for context are mentioned in the very first message.
Server will immediately shut down a client that is causing Slow Consumer.
Yes, I can see that from server logs:
[1] 2024/07/11 17:53:35.327473 [INF] 10.20.120.11:43944 - cid:9 - Slow Consumer Detected: WriteDeadline of 10s exceeded with 17 chunks of 1048747 total bytes.
[1] 2024/07/11 17:53:35.327598 [DBG] 10.20.120.11:43944 - cid:9 - Client connection closed: Slow Consumer (Write Deadline)
This is why I'm wondering why would client stop sending and receiving messages all of a sudden if the app is running fine and wired network is up all the time.
Ah, you think slow consumer is the outcome of client not processing anything, not other way around.
I can run some workloads, but I need at least the rough scenario of what needs to happen to reproduce it.
Ah, you think slow consumer is the outcome of client not processing anything, not other way around.
Yes, that is my only plausible conclusion so far.
I can run some workloads, but I need at least the rough scenario of what needs to happen to reproduce it.
There are two services (often happen to be on the same machine) communicating through NATS (always on a different machine). The pattern is of stream response: one app is sending request and another is streaming responses to requester's subject and requester sends async acks back for backpressure. The responder is the app that is disconnecting periodically, never requester.
Streaming response has ~1G of data chunked into ~1.8M chunks (server is configured with 2M message size limit). Responder sends two messages and waits for one ack before sending each next message.
Responder is sometimes streaming back multiple responses to differnet clients interleaved with requests to a third service (tiny few bytes request, single ~1M response for each request).
This is just Core NATS, without JetStream, and I assume you're using publish_with_reply
+ subscription?
Just Core NATS, regular client.publish
with subscription.
Here is the exact code for stream request and stream request below it (Iknow it is a lot, just it case it is helpful): https://github.com/subspace/subspace/blob/fabdd292ec43c492d975eab1830bd391c8ad6aa6/crates/subspace-farmer/src/cluster/nats_client.rs#L658-L693
Are you by any chance running blocking tasks on the async runtime? tokio-console
can help you find that out
Generally: yes, but it is done carefully in a very few strategic places with block_in_place
inside of a separate task to make sure the rest of tasks or runtime in general are not blocked.
Since NATS client is in its own task created by the library, it should have been able to at least fill channels with incoming messages and print slow consumer client-side, but it didn't do anything at all.
I have tripple-checked the code and have not found any violations that would cause such behavior.
Client side slow consumers is something different than server side slow consumers: client side slow consumers is triggered when client pushes received messages to internal buffer quicker than the user is consuming them, causing buffer to reach its capacity.
Yes, I found that in the code already. I see no reason for application to be unable to read a message from the socket and push it into the subscription-specific channel for 10 seconds straight right now.
hey @nazar-pc I tried to reproduce it locally, and so far, no luck. Would be awesome if you had a reproduction code.
Thanks for checking it! I think at this point it'd be great to add some targeted instrumentation into the library so I can ask users to run a custom build of the software and see how it behaves. I don't have a reliable and easy to use reproduction myself, let alone one I can share with you.
After tracking a few very confusing bugs I no longer believe this is a bug in nats.rs, sorry for wasting your time
No worries. Good to hear you were able to fix the problem!
Observed behavior
Under heavy CPU load connection sometimes breaks in a way that is not observable and root cause is not clear.
There are many applications connecting to NATS server on this machine, but only one of them broke (one that is most heavily loaded). Application sends a notification and waits for another app to acknowledge, which never arrived:
In fact the message going out (with
client.publish
) silently didn't make it out at all, another application that was supposed to receive this message worked fine all this time and only missed messages from this broken sender.Just a few seconds ago prior to this it was working fine and making successful requests, there was also a subscription started after some of the messages were published that may or may not be related here:
I can see from NATS server logs that server wasn't able to deliver messages to the client for some reason (even though other apps on the same machine were working just fine, so network link is not an issue here):
Only much later client detected that it is apparently disconnected, there were other async_nats messages with default log level before this all the way to initial connection:
And things started working again for some time after that.
While connection was broken requests were all failing with
request timed out: deadline has elapsed
.The only connection option customized is request timeout, which was set to 5 minutes (due to some async responders that may take a while to respond).
I'll try to collect logs with
async_nats=debug
orasync_nats=trace
for better insight into what was happening there.Expected behavior
I expect to see errors or for errors to be returned from the library when messages are not making it out, not for things to break silently.
Server and client version
nats-server: v2.10.17 async-nats 0.35.1
Host environment
NATS server running on Dual Socket 2680v3 24 cores (48 threads) with 20 Gbps network. Client (with other apps) is running on Dual Socket 7742 128 cores (256 threads) with 100 Gbps link.
This is a user-reported setup, I do not have access to it myself and not sure what kind of switch they're using to connect those machines together, but I don't think it matters much in this case.
Steps to reproduce
No response