Add a possibility to administratively drop a client from NATS

wutkemtt commented 4 years ago

[ ] Defect [X] Feature Request or Change Proposal

Feature Requests

Use Case:

When a client threats a NATS servers or clusters health by openening too many connections or leaking too much performance by requesting messages with missing MaxPubInFlight option, it should be able to drop this client from the server or cluster.

Proposed Change:

NATS needs a signal or special published messages to drop the mentioned client.

Who Benefits From The Change(s)?

All NATS server / cluster operators in a similar case

Alternative Approaches

kozlovic commented 4 years ago

@wutkemtt Again, in this case it would just adds to the problem of the reconnecting issue discussed in the other server. I understand that when a bad actor creates havoc in the NATS cluster, it is frustrating, but a server cannot prevent a rogue client to misbehave. It can't prevent a client to create a TCP connection. It could try to hold that connection but don't do anything, but the client is free to close the connection (connection timeout), then try again, etc..

You can limit the impact by vetting the applications that can connect to the cluster and give then username/password that only those applications can then use the cluster (again, even that does not prevent a rogue client to try to connect without proper credentials).

Keeping this opened for discussion along the other one to try find a viable solution.

wutkemtt commented 4 years ago

Just to explain it a little bit more: I propose to have a manual possibility for our operations personnel to drop clients from the NATS infrastructure if there are certain evidences that the client misbehaves and threats the whole cluster. Based on alarmings and information out of the metrics and logs an administrator could be able to find these types of clients. Droping them early would prevent further damage and hopefully an cluster outage too.

ripienaar commented 4 years ago

Given that the client will probably just reconnect immediately if it’s administratively disconnected and then maybe with or without an ID simply disconnecting it won’t help. What other kind of disconnect / deny method do you propose? To be of use you need to disconnect it and keep it disconnected? And keep it disconnected at a layer before the NATS server TCP port.

One option - at the NATS layer - is to set limits on the JWT, today we have a per account max connections limit which might be a viable option? I don’t know how feasible but I imagine a per user limit could also be supported so you would give each service a user and then limit the number of connections it’s allowed - again I don’t know the JWT code or how feasible such a feature would be but I am mentioning it to see what you think if that will help.

Even with that though a badly behaving micro service can make 10s of thousands of connections very rapidly and such limits would be evaluated and client disconnected - only to have it reconnnect again and consume more resources. So at the core it would not help that much either. Any form of limit we add would apply after a client is authenticated.

In my own client code I add a exponential back off around initial reconnects to help with this, mitigations we add on the server can only do so much

kozlovic commented 4 years ago

@wutkemtt In reference to the other issue you opened, it looks to me like those two issues contradict themselves. In one you want to protect from possible denial of service with clients that (re)connect a lot, but here you ask for a way to forcibly close a misbehaving client, which, as @ripienaar mentioned, would lead to just that, the client trying to reconnect right away.

I have discussed in length in the other issue (https://github.com/nats-io/nats-streaming-server/issues/1022#issuecomment-587559728) why I believe that the server cannot take actions to prevent this issue unfortunately. I also should have pointed out that for some users, having an application that connects, sends a message, then disconnect, is considered normal (although this is a behavior that we discourage, we would not want to break that).

wutkemtt commented 4 years ago

What are the possibilities in such a situation when a client floods the nats cluster with messages or reconnects in a harmful manner to prevent further damage and to get the cluster back online in a fast way? In our experience the cluster is droping all other clients so that the communication totally breaks. That means our apps and services are offline and we have an outage. Wouldn't it be better to detect such clients and to block or drop them either? Perhaps this prevents the cluster from failing completely and provides a chance to keep the rest of the apps and services running. I think it is not realistic to rely on correctly implemented clients. With the cloud environment a cluster which is not able to defend itself opens a huge attack vector for denial-of-service attacks. As operators our primary task is to defend the cluster at all costs, but I don't see much possibilities to do that.

ripienaar commented 4 years ago

I think the question is still: block by what means? They will just reconnect. We can’t block the IP on our own to drop traffic at a TCP level.

They will just reconnect. This is why in web server world you have employ separate WAF that protects your servers from these situations.

There are tools in the intrusion detection world that can profile incoming traffic and drop by means of firewall rules ones making many connections that might help in your case. We don’t have experience in using any of those with NATS though I think it’s something we could probably be engaged around. Ultimately that solution would be very tailored to your environment.

ripienaar commented 4 years ago

You can also do things like iptables -A INPUT -p tcp --syn -m limit --limit 1/s --limit-burst 3 -j RETURN that will rate limit new connections from specific IPs etc. But these are not things the NATS server gets involved with. As avice I am sure we can be engaged to help explore this kind of approach

kozlovic commented 4 years ago

@wutkemtt Instead of closing, I am wondering if it would not be better in this case to put the client in a mode that the server stops reading from the socket. Since most of our client libraries use blocking socket to send messages, it would effectively block the application at one point, and at the very least stop the flooding of the cluster. That being said, again, the client library will ultimately detect the lack of PONG to its PING and close the connection and try again, but overall it may be a better approach than the disconnect? (in that it may delay the reconnect). Thoughts?

wutkemtt commented 4 years ago

I did not think of the standard client libraries before as a possible solution vector. I like the idea to harden the default client implementation a littlebit. Would it be possible to set a default maxPubAcksInFlight and MaxInFlight to 1, enabling these options for all connections until explicitely overwritten? Furthermore would it be possible to integrate a counter for each target in the connect method of the NATS client to slow down additional connect calls to the same target more and more, when the target server is not reachable? When the connect succeeds the counter could be resetted. This would prevent extra-ordinary connect calls for implementing a re-connect instead of using the default reconnect feature already integrated in the lib.

kozlovic commented 4 years ago

Would it be possible to set a default maxPubAcksInFlight and MaxInFlight to 1, enabling these options for all connections until explicitely overwritten?

I have to be honest, this is going to be tough because if we change the library to now default to these low values unless explicitly set, then it means that users that just upgrade the library will suddenly see the performance of their app drop tremendously.

Furthermore would it be possible to integrate a counter for each target in the connect method of the NATS client to slow down additional connect calls to the same target more and more, when the target server is not reachable? When the connect succeeds the counter could be resetted.

Help understand this.. for each failed connect attempt, the library currently the attempted url to the end of the list. But for each server that it already attempted to connect to, it does pause for ReconnectWait seconds. What you are suggesting is that this amount of time increases overtime while it fails to connect? And to be clear, the failure to connect here is a TCP failure, or something else?

This would prevent extra-ordinary connect calls for implementing a re-connect instead of using the default reconnect feature already integrated in the lib.

Sorry, I am not sure I understand the statement above. Could you please elaborate? Also, if easier to discuss, we can arrange for a call, or if you are on Slack, to exchange direct messages/call. Let me know.

wutkemtt commented 4 years ago

Our developer teams used their own reconnect implementation for the nats connection due to a lack of knowledge and not read documentation. This caused the connection issue in our cluster. So my idea was to have a counter for each failed connection which leads to an ever increasing wait time for additional connections to the same failed target server when used with a fixed timespan. If a connection then succeeds, all counters could be resetted.

In the meantime with some discussions between DevOps and Developer teams we are planning to provide a standardized wrapper library for the developers, which requires some mandatory options and ensures, that specific parameters are set within predefined boundaries. We want to make sure that all clients are using the same connection mechanism and have all set a clientid, the maxinflight and maxpubinflight when connecting to stan.

Julienhardis commented 3 years ago

Hello, we use nats in our system with agents which have a persistent clientid. Sometimes when we have an incident in production and retry to execute a agent which has stopped unfortunatly, we can't because the clientid already exists for the nats streaming service. Our solution is to restart service but it's not optimal because others publisher clients can't send messages how long the service is down and we can probably loose some messages. Do you have a better solution ? A mean to disconnect a client with a new command is it a conceivable new feature ? Thank you by advance for your answer.

kozlovic commented 3 years ago

@Julienhardis

retry to execute a agent which has stopped unfortunatly, we can't because the clientid already exists for the nats streaming service.

If a streaming client is stopped (without having a chance to call connection.Close()) then indeed the server does not know and consider this client connected. However, if another application tries to connect with the same client ID, the server detects as a duplicate, but part of this process is to then contact the "old" client on an inbox to see if it responds. If it does, then the server reject the new client, but if that fails, it replaces the old with the new. So if you fail to start a new client with the same client ID, it means that the "old" is actually still running. So are you sure that the first client is really failed? I may not have understood what you were saying, so please correct me if I am wrong.

Julienhardis commented 3 years ago

Thank you for your answer (You perfectly understood my problem).

Yes, our client is down but nats thread persisted in the JVM. We can see it with the java console.

May be it is in link with the version 2.2.3 we use because i found this issue wich fix a similar problem in version 2.8.0 :

https://github.com/nats-io/nats.java/issues/324

Do you think this version may fix my problem ?

kozlovic commented 3 years ago

If your client is "down" but the library is still running, especially the thread that replies to heartbeat from the server, then yes, you won't be able to restart any client using that client ID.

I would recommend that you try with the latest client (and server) release(s) (always a good idea to try with the latest release of lib and/or server to see if the problem is fixed).

Let me know if upgrading has addressed this issue. Thanks!

nats-io / nats-streaming-server