nats-io / nats.net.v1

The official C# Client for NATS
Apache License 2.0
646 stars 154 forks source link

NATSConnectionException: Authentication Timeout #883

Open Zetanova opened 5 months ago

Zetanova commented 5 months ago

Observed behavior

The nats.client <1.1.4 has sometimes an issue to connect to a server with "Authentication Timeout" exception. This happens sometimes on vs debug sessions on windows or docker-desktop and in production under k8s with dotnet 6 and 8. The nats server has an auth timeout of 5s.

With an new aspnet project in production that triggered this issue not only sometimes, but always at startup in k8s

The nats.client requires real-concurrency (2+ cores) to connect. A cpu limit env. like docker run --cpus=1 will trigger the lock after the UserSignatureEventHandler call

Expected behavior

the connection should succeed with cpu limit of 1

Server and client version

nats.client 1.1.4

Host environment

windows and docker
dotnet 6 & 8

Steps to reproduce

set cpu limit to 1 in docker or resources.limits.cpu: 500m in kubernetes

scottf commented 2 months ago

@Zetanova Is this still a problem in 1.1.5? Are there any steps to reproduce?

Zetanova commented 2 months ago

Yes, I workaround it for now by wrapping the connect inside a LongRunning TaskPool Thread But this does not solve the issue completely.

Workaround:

var task = Task.Factory.StartNew(state =>
{
    var fc = new ConnectionFactory();
    return fc.CreateConnection((Options)state!);
},
opt,
CancellationToken.None,
TaskCreationOptions.LongRunning, 
TaskScheduler.Default);

return task.GetAwaiter().GetResult();
scottf commented 2 months ago

Is the problem that the client wants more threads than the environment it's running in can provide? Is this a real production configuration? This doesn't seem like a defect to me, it just like there are minimum requirements to run the client.

Zetanova commented 2 months ago

Yes debug and production on startup.

The issue is the hidden deadlock

task.Wait() and wait-events are used running on the TaskPool at least one free TaskPool-Thread needs to be idle/free to execute the connect.

Until the TaskPoolManager decides to increase the Thread count on process startup connect the "Authentication Timeout" exception has already been thrown.

Zetanova commented 2 months ago

Kubernetes resources.limits.cpu = 1000m will trigger it instanlty. Its the same as docker --cpus=1 (cgroup cpu=1)