nats-io / nats-streaming-server

NATS Streaming System Server
https://nats.io
Apache License 2.0
2.51k stars 284 forks source link

Add server configuration options for mandatory client connection options #1022

Open wutkemtt opened 4 years ago

wutkemtt commented 4 years ago

Feature Requests

Use Case:

Misconfigured clients are able to destabilize or crash a whole nats cluster with missing MaxPubInFlight option or too many re-connection attempts (thundering herd).

Proposed Change:

The NATS server should be able to limit the number of connections from a single client, preventing it to create a new connection every time a reconnect happens, while the old connection is still there. Furthermore it should be able to block client connections if specific connection features are not set, especially MaxPubInFlight. Unset this option leads to enormous caching amount on NATS-Streaming server side leading to unresponsive clusters and dropped client connections when requesting a huge number of messages.

Who Benefits From The Change(s)?

All NATS server / cluster operators and their customers benefit from a stable cluster, who can defend itself against "rogue" clients.

Alternative Approaches

kozlovic commented 4 years ago

I have transferred this issue since it is related to NATS Streaming, not NATS Server.

Our client libraries usually set a default for MaxPubAcksInflight when you use default options, but it is possible that some don't. However, this is a client setting, meaning that the server has no control on that. This option is to make publish async calls to block (in the client) if the client has sent that many messages without getting acks back from the server.

If we were to change the protocol to include this parameter in the connect request the server could simply reject the connection if the value is not set. However, this would be a compatibility issue since the server would not be able to tell the difference between old clients that do not pass this value to the connect request, and new client that incorrectly had user set it to 0. Furthermore, if the value is set but very high, the end result would be the same.

This report has several asks, so it may become messy, but regarding the number of reconnect, what are you suggesting? For instance, with the issue of MaxPubAcksInflight and if the server was to reject the connect, it would lead to the issue of the client trying to reconnect. What do you want the server to do then? The server already "prevents to create a new connection every time a reconnect happens, while the old connection is still there", this is what the ClientID is for. Only one connection with the same client ID is allowed at any time.

We understand and want to make sure that the NATS cluster stays healthy, but some of the proposed approaches won't work. We need to discuss this more.

wutkemtt commented 4 years ago

The clientID would solve the reconnection issue for us, if it is mandatory, but it is not. You can connect to a NATS server without mention a clientid. The NATS server will generate an own ID for the client and the current connection. So everytime the same client creates a connection it gets a new id and the new connection can not be blocked as duplicated. That is a real problem because such clients behave like denial of sevrice attacks against the NATS using all the available client connections until the overall limit is reached. Is there any chance to prevent such a situation?

kozlovic commented 4 years ago

@wutkemtt I have transferred this issue to NATS Streaming because you referenced the issue being a client that did not set MaxPubInflight option, and this is only relevant to NATS Streaming, not core NATS. And as I mentioned, this option is a client level option, not server. It is correct that NATS core does not have the concept of client ID when creating a connection, and there is no way from the NATS Server to prevent a denial of service if malicious clients bombard it creating TCP connections. Any auth (user/password) is obviously done after the connection is accepted.

As I hinted before, even if the server were to detect that a remote IP has been connecting/disconnecting very quickly in the past say few seconds, it (the server) could hold the connection after the accept but not do anything with it, but a rogue client can always close its connection or simply try to connect again: this would still use up all TCP ports. (furthermore, using the remote IP as a filter is not advised because of possibly NAT, etc..).

The bottom line is that if in this case the filtering has to be done upstream.

I think that you had an issue that after investigation it turned out to be an application flooding the NATS Streaming server because that application was not setting MaxPubInflight. We could make sure that if you are using one of the library that does not set this as a default, we fix those libraries. But ultimately, you will need to vet your applications and make sure they behave correctly. I don't see currently an approach that would have prevented the issue you had, but I am surely opened to suggestions.

wutkemtt commented 4 years ago

The MaxPubInflight issue was the root cause for the flooding of our nats streaming instance. The reconnect problem is also based on a fault client implementation. So it seems, the only way to prevent harm to our nats cluster and to prevent thousands of euro costs when the system break is to "certify" the client implementations of all participants wanting to send data over the nats cluster. Got it, thanks for the explanation.

kozlovic commented 4 years ago

@wutkemtt I am sorry if my answer sounded like "this is your problem". I had left the issue opened because I wanted us to discuss the issue and possible approaches to fix the problem. I did not have an answer on how to prevent the issue, but does not mean that we would not find one. I merely tried to explain that the current proposed approach would not help.

So I am re-opening this and let's brainstorm and try to find a solution.

wutkemtt commented 4 years ago

Sorry, sounded more like can't be solved in NATS server. I would like to follow the client approach on the other request as this could be a way, how to develop an initial solution pattern for the current problems.