Silent connection breakage

nazar-pc commented 4 months ago

Observed behavior

Under heavy CPU load connection sometimes breaks in a way that is not observable and root cause is not clear.

There are many applications connecting to NATS server on this machine, but only one of them broke (one that is most heavily loaded). Application sends a notification and waits for another app to acknowledge, which never arrived:

2024-06-28T07:56:22.294473Z DEBUG subspace_farmer::cluster::nats_client: Publishing stream response messages response_subject=stream-response.01J1EXCVDTTR4K1ESFQY85PPFG num_messages=1 index=14 is_done=false
2024-06-28T07:56:22.294491Z TRACE subspace_farmer::cluster::nats_client: Waiting for acknowledgement response_subject=stream-response.01J1EXCVDTTR4K1ESFQY85PPFG expected_index=13

In fact the message going out (with client.publish) silently didn't make it out at all, another application that was supposed to receive this message worked fine all this time and only missed messages from this broken sender.

Just a few seconds ago prior to this it was working fine and making successful requests, there was also a subscription started after some of the messages were published that may or may not be related here:

2024-06-28T07:56:12.486720Z TRACE Retrieved piece from cache successfully piece_index=77946 piece_cache_id=01J1EVT385KM8YT92E8GGABV56 piece_cache_offset=77902
2024-06-28T07:56:12.489065Z TRACE Retrieved piece from cache successfully piece_index=12452 piece_cache_id=01J1EVXQC5ZQFHTBK4B2AKZEQG piece_cache_offset=12201
2024-06-28T07:56:12.491700Z TRACE subspace_farmer::cluster::controller: Retrieved piece from cache successfully piece_index=24556 piece_cache_id=01J1EVXQC5ZQFHTBK4B2AKZEQG piece_cache_offset=24321
2024-06-28T07:56:12.492464Z DEBUG subspace_farmer_components::plotting: Sector downloaded successfully sector_index=838
...
2024-06-28T07:56:16.437012Z DEBUG subspace_farmer::cluster::plotter: Free instance subscription subscription=Subscriber { sid: 118, receiver: Receiver { chan: Rx { inner: Chan { tx: Tx { block_tail: 0x2ed69f98800, tail_position: 0 }, semaphore: Semaphore { semaphore: Semaphore { permits: 65536 }, bound: 65536 }, rx_waker: AtomicWaker, tx_count: 1, rx_fields: "..." } } }, sender: Sender { chan: Tx { inner: Chan { tx: Tx { block_tail: 0x2ed686d4000, tail_position: 143590 }, semaphore: Semaphore { semaphore: Semaphore { permits: 2026 }, bound: 2048 }, rx_waker: AtomicWaker, tx_count: 38, rx_fields: "..." } } } }

I can see from NATS server logs that server wasn't able to deliver messages to the client for some reason (even though other apps on the same machine were working just fine, so network link is not an issue here):

[1] 2024/06/28 07:56:22.738998 [INF] 10.20.120.11:55398 - cid:7 - Slow Consumer Detected: WriteDeadline of 10s exceeded with 183 chunks of 11537598 total bytes.

Only much later client detected that it is apparently disconnected, there were other async_nats messages with default log level before this all the way to initial connection:

2024-06-28T08:04:30.541474Z  INFO async_nats: event: disconnected
2024-06-28T08:04:30.561621Z  INFO async_nats: event: connected

And things started working again for some time after that.

While connection was broken requests were all failing with request timed out: deadline has elapsed.

The only connection option customized is request timeout, which was set to 5 minutes (due to some async responders that may take a while to respond).

I'll try to collect logs with async_nats=debug or async_nats=trace for better insight into what was happening there.

Expected behavior

I expect to see errors or for errors to be returned from the library when messages are not making it out, not for things to break silently.

Server and client version

nats-server: v2.10.17 async-nats 0.35.1

Host environment

NATS server running on Dual Socket 2680v3 24 cores (48 threads) with 20 Gbps network. Client (with other apps) is running on Dual Socket 7742 128 cores (256 threads) with 100 Gbps link.

This is a user-reported setup, I do not have access to it myself and not sure what kind of switch they're using to connect those machines together, but I don't think it matters much in this case.

Steps to reproduce

No response

nazar-pc commented 4 months ago

Current suspicion is that executor might have been blocked on a thread that async-nats's task was sharing, will test this theory.

Jarema commented 4 months ago

Hey!

First of all, keep in mind that it's totally normal for the client to not know that the connection is down, especially if there is not much traffic coming through.

That's why there is PING in Nats protocol, which makes sure that the connection is fuctioning fine.

In most cases, the client will figure out that the TCP connection is down (or rather, TCPStream will), but, not always. In those cases, client rely on the PING.

If you want to make the client more sensitive, you can specify lower ping interval when creating the client.

When it comes to the publish - in Core NATS, client has a publish buffer that aggregates some messages to optimize the throughput. If connection is stale, client will buffer those messages, and try to send them, but until buffer is full, it will not error. You can tweak this using the client capacity.

For getting events on connection, use events_callback

nazar-pc commented 4 months ago

I understand, but I don't believe there was a network issue for connection to be dropped in the first place since other clients running on the same exact machine are fine and that means TCP connection will not go down for no reason.

I checked whether executor was blocked and I don't believe that was the case.

According to latest investigation details even sending channel inside of NATS client was clogged, so somehow NATS client was neither sending nor receiving messages for quite a few seconds. Here is a delay on client.publish() call:

2024-07-03T11:24:16.106925Z subspace_farmer::cluster::nats_client: Publishing stream response messages response_subject=stream-response.01J1W455837Z5XPXG1P8Z2W8E5 num_messages=1 index=250 is_done=false
2024-07-03T11:24:30.439486Z subspace_farmer::cluster::nats_client: Waiting for acknowledgement response_subject=stream-response.01J1W455837Z5XPXG1P8Z2W8E5 expected_index=249

And client did not receive messages from the server in the middle of that:

[1] 2024/07/03 11:24:26.072162 [INF] 10.20.120.11:52368 - cid:9 - Slow Consumer Detected: WriteDeadline of 10s exceeded with 246 chunks of 15731404 total bytes.

I asked user to run an app with async_nats=trace, but wondering if someone can spot an edge-case since the code handling sending and receiving is non-trivial.

nazar-pc commented 4 months ago

async-nats logging is not super consistent (.request() results in trace message and .publish() doesn't for example), the only additional piece information with trace level that was received is that connection is closed this way:

2024-07-08T08:09:22.691898Z DEBUG async_nats: retry on initial connect failure is disabled
2024-07-08T08:09:22.691909Z DEBUG async_nats::connector: connecting
2024-07-08T08:09:22.696141Z DEBUG async_nats::connector: connected to 4222
2024-07-08T08:09:22.697733Z DEBUG async_nats: received PONG
2024-07-08T08:10:24.865933Z DEBUG async_nats: received PONG
2024-07-08T09:24:57.076777Z DEBUG async_nats: disconnected err=Some(Kind(ConnectionReset))
2024-07-08T09:24:57.076839Z DEBUG async_nats::connector: connecting
2024-07-08T09:24:59.463834Z DEBUG async_nats::connector: connected to 4222
2024-07-08T09:24:59.464263Z DEBUG async_nats: reconnected
2024-07-08T10:10:52.426351Z DEBUG async_nats: disconnected err=Some(Kind(ConnectionReset))
2024-07-08T10:10:52.426408Z DEBUG async_nats::connector: connecting
2024-07-08T10:10:52.433101Z DEBUG async_nats::connector: connected to 4222
2024-07-08T10:10:52.433268Z DEBUG async_nats: reconnected
2024-07-08T10:26:50.410512Z DEBUG async_nats: disconnected err=Some(Kind(ConnectionReset))
2024-07-08T10:26:50.410583Z DEBUG async_nats::connector: connecting
2024-07-08T10:26:52.195553Z DEBUG async_nats::connector: connected to 4222
2024-07-08T10:26:52.195798Z DEBUG async_nats: reconnected
2024-07-08T10:30:37.433346Z DEBUG async_nats: disconnected err=Some(Kind(ConnectionReset))
2024-07-08T10:30:37.433405Z DEBUG async_nats::connector: connecting
2024-07-08T10:30:38.498512Z DEBUG async_nats::connector: connected to 4222
2024-07-08T10:30:38.498682Z DEBUG async_nats: reconnected
2024-07-08T11:25:05.451461Z DEBUG async_nats: disconnected err=Some(Kind(ConnectionReset))
2024-07-08T11:25:05.451536Z DEBUG async_nats::connector: connecting
2024-07-08T11:25:05.641422Z DEBUG async_nats::connector: connected to 4222
2024-07-08T11:25:05.738965Z DEBUG async_nats: reconnected
2024-07-08T12:55:19.602283Z DEBUG async_nats: disconnected err=Some(Os { code: 104, kind: ConnectionReset, message: "Connection reset by peer" })
2024-07-08T12:55:19.602340Z DEBUG async_nats::connector: connecting
2024-07-08T12:55:22.762921Z DEBUG async_nats::connector: connected to 4222
2024-07-08T12:55:22.768459Z DEBUG async_nats: reconnected
2024-07-08T12:57:23.732625Z DEBUG async_nats: received PONG
2024-07-08T14:53:43.206239Z DEBUG async_nats: disconnected err=Some(Os { code: 104, kind: ConnectionReset, message: "Connection reset by peer" })
2024-07-08T14:53:43.210118Z DEBUG async_nats::connector: connecting
2024-07-08T14:53:43.220725Z DEBUG async_nats::connector: connected to 4222
2024-07-08T14:53:43.221056Z DEBUG async_nats: reconnected
2024-07-08T15:27:18.218409Z DEBUG async_nats: received PONG

Looks like it doesn't send pings unless there is a lack of other activity, which makes sense, but why would it disconnect so frequently with ConnectionReset?

nazar-pc commented 4 months ago

After reading through various tokio issues I have found https://github.com/tokio-rs/tokio/issues/4941 and decided to try https://github.com/tokio-rs/tokio/pull/4936 to work around it, but client still stops receiving messages the same way from time to time.

What else can I do to find why connection breaks? I see no reason for this to happen, yet it does.

Jarema commented 4 months ago

I wonder - Isn't this a case for a Slow Consumer? If there is a lot of messages, and client subscriber cannot keep up with the rate, at some point the server will diwconnect slow client to protect itself.

Please chceck the server logs.

Also: can you provide reproduction, so I can run it on my machine?

nazar-pc commented 4 months ago

Server logs slow consumer (check above messages), but client doesn't print slow consumer, so it is fine. Rust client just simply stops sending and receiving messages occasionally and later wakes up all of a sudden. I'm quite confident client is able to keep up with orders of magnitude higher load.

There is not reproduction unfortunately and it is not 100% deterministic, it just happens from once/twice an hour to once every few hours. The app is quite large (though 100% open source) and extremely heavy on CPU and RAM, so I would not necessarily recommend trying to run it.

NATS server and client hardware for context are mentioned in the very first message.

Jarema commented 4 months ago

Server will immediately shut down a client that is causing Slow Consumer.

nazar-pc commented 4 months ago

Yes, I can see that from server logs:

[1] 2024/07/11 17:53:35.327473 [INF] 10.20.120.11:43944 - cid:9 - Slow Consumer Detected: WriteDeadline of 10s exceeded with 17 chunks of 1048747 total bytes.
[1] 2024/07/11 17:53:35.327598 [DBG] 10.20.120.11:43944 - cid:9 - Client connection closed: Slow Consumer (Write Deadline)

This is why I'm wondering why would client stop sending and receiving messages all of a sudden if the app is running fine and wired network is up all the time.

Jarema commented 4 months ago

Ah, you think slow consumer is the outcome of client not processing anything, not other way around.

I can run some workloads, but I need at least the rough scenario of what needs to happen to reproduce it.

nazar-pc commented 4 months ago

Ah, you think slow consumer is the outcome of client not processing anything, not other way around.

Yes, that is my only plausible conclusion so far.

I can run some workloads, but I need at least the rough scenario of what needs to happen to reproduce it.

There are two services (often happen to be on the same machine) communicating through NATS (always on a different machine). The pattern is of stream response: one app is sending request and another is streaming responses to requester's subject and requester sends async acks back for backpressure. The responder is the app that is disconnecting periodically, never requester.

Streaming response has ~1G of data chunked into ~1.8M chunks (server is configured with 2M message size limit). Responder sends two messages and waits for one ack before sending each next message.

Responder is sometimes streaming back multiple responses to differnet clients interleaved with requests to a third service (tiny few bytes request, single ~1M response for each request).

Jarema commented 4 months ago

This is just Core NATS, without JetStream, and I assume you're using publish_with_reply + subscription?

nazar-pc commented 4 months ago

Just Core NATS, regular client.publish with subscription.

Here is the exact code for stream request and stream request below it (Iknow it is a lot, just it case it is helpful): https://github.com/subspace/subspace/blob/fabdd292ec43c492d975eab1830bd391c8ad6aa6/crates/subspace-farmer/src/cluster/nats_client.rs#L658-L693

paolobarbolini commented 4 months ago

Are you by any chance running blocking tasks on the async runtime? tokio-console can help you find that out

nazar-pc commented 4 months ago

Generally: yes, but it is done carefully in a very few strategic places with block_in_place inside of a separate task to make sure the rest of tasks or runtime in general are not blocked.

Since NATS client is in its own task created by the library, it should have been able to at least fill channels with incoming messages and print slow consumer client-side, but it didn't do anything at all.

I have tripple-checked the code and have not found any violations that would cause such behavior.

Jarema commented 4 months ago

Client side slow consumers is something different than server side slow consumers: client side slow consumers is triggered when client pushes received messages to internal buffer quicker than the user is consuming them, causing buffer to reach its capacity.

nazar-pc commented 4 months ago

Yes, I found that in the code already. I see no reason for application to be unable to read a message from the socket and push it into the subscription-specific channel for 10 seconds straight right now.

Jarema commented 3 months ago

hey @nazar-pc I tried to reproduce it locally, and so far, no luck. Would be awesome if you had a reproduction code.

nazar-pc commented 3 months ago

Thanks for checking it! I think at this point it'd be great to add some targeted instrumentation into the library so I can ask users to run a custom build of the software and see how it behaves. I don't have a reliable and easy to use reproduction myself, let alone one I can share with you.

nazar-pc commented 2 weeks ago

After tracking a few very confusing bugs I no longer believe this is a bug in nats.rs, sorry for wasting your time

Jarema commented 2 weeks ago

No worries. Good to hear you were able to fix the problem!

nats-io / nats.rs