nats-io / stan.net

The official NATS .NET C# Streaming Client
Apache License 2.0
137 stars 41 forks source link

Cannot re-connect to server after nats-streaming-server is restarted #162

Closed robertmircea closed 4 years ago

robertmircea commented 4 years ago

I've encountered the following issue while trying to reconnect to nats-streaming-server. Here are the steps to reproduce:

  1. nats-streaming-server is running
  2. the app (see demo code below) connects to nats-streaming-server and starts publishing messages
  3. the nats-streaming-server is killed
  4. on each subsequent Publish, STAN.Client.StanConnectionException: The NATS connection is reconnecting is printed
  5. after a while (server still not up), on each Publish, the app starts to receive STAN.Client.StanConnectionClosedException: Connection closed.
  6. I restart nats-streaming-server
  7. The app's ReconnectedEventHandler is fired, detecting nats connection restore.
  8. But the app's StanConnection never recovers and the application can never resume successful publishing.

What can I do in order to allow the app to recover publishing?

Running on OSX Catalina and using latest nuget packages available at the current time:

    <ItemGroup>
        <PackageReference Include="NATS.Client" Version="0.10.0" />
        <PackageReference Include="STAN.Client" Version="0.2.0" />
    </ItemGroup>

Code used for testing:

using System;
using System.Diagnostics;
using System.Threading;
using NATS.Client;
using STAN.Client;

namespace StanTest
{
    class Program
    {
        static void Main(string[] args)
        {
            var cf = new ConnectionFactory();
            var sf = new StanConnectionFactory();
            var natsConnection = cf.CreateConnection(GetOpts());

            var stanOpts = StanOptions.GetDefaultOptions();
            stanOpts.ConnectTimeout = 4000;
            stanOpts.NatsConn = natsConnection;
            stanOpts.PubAckWait = 40000;

            var stanConnection = sf.CreateConnection("test-cluster", "uniq123", stanOpts);
            var watch = Stopwatch.StartNew();

            while (true)
            {
                try
                {
                    stanConnection.Publish("test", new byte[0x1]);
                    Console.WriteLine("{0}. Published message", watch.Elapsed);
                }
                catch (Exception e)
                {
                    Console.WriteLine("{0} - Type:{1}. On Publish: exception message: {2}", watch.Elapsed, e.GetType(), e.Message);
                }
                finally
                {
                    Thread.Sleep(1000);
                }
            }
        }

        private static Options GetOpts()
        {
            var opts = ConnectionFactory.GetDefaultOptions();
            opts.Url = "nats://localhost:4222";
            opts.AllowReconnect = true;
            opts.PingInterval = 5000;
            opts.MaxPingsOut = 2;
            opts.MaxReconnect = Options.ReconnectForever;
            opts.ReconnectWait = 1000;
            opts.Timeout = 4000;

            opts.ServerDiscoveredEventHandler += (sender, args) => Console.WriteLine("NATS server discovered");

            opts.ReconnectedEventHandler +=
                (sender, args) =>  Console.WriteLine( "NATS server reconnected.");
            opts.ClosedEventHandler +=
                (sender, args) => Console.WriteLine("NATS connection closed");
            opts.DisconnectedEventHandler += (sender, args) =>
                Console.WriteLine("NATS connection disconnected");
            opts.AsyncErrorEventHandler +=
                (sender, args) => Console.WriteLine("NATS async error: {0}, Message={1}, Subject={2}", args.Conn.ConnectedUrl,
                    args.Error, args.Subscription.Subject);

            return opts;
        }
    }   
}

And here are the messages printed to console while running:

/usr/local/share/dotnet/dotnet /Users/robert/Sandbox/StanTest/bin/Debug/netcoreapp3.0/StanTest.dll
00:00:00.0237181. Published message
00:00:01.0443621. Published message
NATS connection disconnected
00:00:02.0483280 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:03.0528216 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:04.0580304 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:05.0612458 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:06.0655462 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:07.0696808 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:08.0721798 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:09.0763581 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:10.0803322 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:11.0852381 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:12.0898777 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:13.0904445 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:14.0946184 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:15.0953681 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:16.0963306 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:17.1017706 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:18.1047256 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:19.1077148 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:20.1114949 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:21.1162681 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:22.1206395 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:23.1238276 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
NATS server reconnected.
00:00:24.1287867 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:25.1324792 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:26.1373635 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:27.1406061 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:28.1452916 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:29.1501845 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
^C
ColinSullivan1 commented 4 years ago

@robertmircea , we'll take a look and get back to you. Thanks for reporting this.

danielwertheim commented 4 years ago

@robertmircea @ColinSullivan1 I've tested with embedded NATS-Server in NATS-Streaming-Server (v0.16.2) running in CLI: nats-streaming-server -st file -dir .; on Windows and for me, it's continues, once the timeout has occurred.

00:00:33.0597577. Published message
00:00:34.0734042. Published message
00:00:35.0875684. Published message
NATS connection disconnected
00:00:37.8335607 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
NATS server reconnected.
00:01:18.8890016 - Type:STAN.Client.StanException. On Publish: exception message: Timeout occurred.
00:01:19.9063318. Published message
00:01:20.9223684. Published message
00:01:21.9371975. Published message

If I do a really fast shutdown-restart, the timeout doesn't even occur.

00:00:05.0761800. Published message
00:00:06.0860053. Published message
00:00:07.0942296. Published message
NATS connection disconnected
NATS server reconnected.
00:00:08.0995376. Published message
00:00:09.1115235. Published message
00:00:10.1255920. Published message
robertmircea commented 4 years ago

The test was done in macos environment using .net core 3.0 runtime. I will try to repeat it in on Windows and on Linux using the same app.

$ system_profiler SPSoftwareDataType
Software:

    System Software Overview:

      System Version: macOS 10.15.1 (19B88)
      Kernel Version: Darwin 19.0.0
      Boot Volume: Macintosh HD
[...]
robertmircea commented 4 years ago

And the results of the test on Windows Server 2019 and NATS streaming server:

nats-streaming-server version 0.16.2, nats-server: v2.0.4

First run:

C:\Users\robert\Desktop>StanTest.Published\StanTest.exe
00:00:00.0187588. Published message
00:00:01.0473612. Published message
00:00:02.0659592. Published message
NATS connection disconnected
Unhandled exception. System.NotSupportedException: Stream does not support seeking.
   at System.IO.BufferedStream.EnsureCanSeek()
   at System.IO.BufferedStream.get_Position()
   at NATS.Client.Connection.publish(String subject, String reply, Byte[] data, Int32 offset, Int32 count, Boolean flushBuffer)
   at NATS.Client.Connection.Publish(String subject, String reply, Byte[] data)
   at STAN.Client.Connection.pingServer(Object state)
   at System.Threading.TimerQueueTimer.<>c.<.cctor>b__23_0(Object state)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location where exception was thrown ---
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.TimerQueueTimer.CallCallback(Boolean isThreadPool)
   at System.Threading.TimerQueueTimer.Fire(Boolean isThreadPool)
   at System.Threading.TimerQueue.FireNextTimers()
   at System.Threading.TimerQueue.AppDomainTimerCallback(Int32 id)

Second run:

C:\Users\robert\Desktop>StanTest.Published\StanTest.exe
00:00:00.0261714. Published message
00:00:01.0495942. Published message
00:00:02.0672070. Published message
00:00:03.0847216. Published message
NATS connection disconnected
00:00:04.6050446 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:05.6844930 - Connection closed..
Stan connection lost handler. Exception message: STAN.Client.StanMaxPingsException: Connection lost due to PING failure.
NATS connection disconnected
NATS connection closed
00:00:14.4242885 - Type:STAN.Client.StanConnectRequestTimeoutException. On Publish: exception message: No response from a streaming server with a cluster ID of 'test-cluster'
00:00:18.4472980 - Type:NATS.Client.NATSConnectionClosedException. On Publish: exception message: Connection is closed.
00:00:22.4773151 - Type:NATS.Client.NATSConnectionClosedException. On Publish: exception message: Connection is closed.
00:00:26.4937447 - Type:NATS.Client.NATSConnectionClosedException. On Publish: exception message: Connection is closed.
00:00:30.5249734 - Type:NATS.Client.NATSConnectionClosedException. On Publish: exception message: Connection is closed.
00:00:34.5555004 - Type:NATS.Client.NATSConnectionClosedException. On Publish: exception message: Connection is closed.
00:00:38.5870016 - Type:NATS.Client.NATSConnectionClosedException. On Publish: exception message: Connection is closed.
00:00:42.6032744 - Type:NATS.Client.NATSConnectionClosedException. On Publish: exception message: Connection is closed.
00:00:46.6340461 - Type:NATS.Client.NATSConnectionClosedException. On Publish: exception message: Connection is closed.
00:00:50.6501095 - Type:NATS.Client.NATSConnectionClosedException. On Publish: exception message: Connection is closed.
00:00:54.6970249 - Type:NATS.Client.NATSConnectionClosedException. On Publish: exception message: Connection is closed.
00:00:58.7125727 - Type:NATS.Client.NATSConnectionClosedException. On Publish: exception message: Connection is closed.
00:01:02.7433533 - Type:NATS.Client.NATSConnectionClosedException. On Publish: exception message: Connection is closed.
^C
danielwertheim commented 4 years ago

@robertmircea with regards to the initial issue (not the Stream seek issue), what's happening is that the outgoing pings exceeds the maximum outgoing pings and then the StanConnection is closed, but as the NATSConnection was injected, this is not cleaned up as the STANConnection doesn't own it. So those handlers keep working.

You could perhaps use stanOpts.ConnectionLostEventHandler to do a manual reconnect?

@ColinSullivan1 What's your take on this? Is this expected behavior? Should something be changed?

ColinSullivan1 commented 4 years ago

I would suggest increasing the number of pings and ping interval first to what best fits the environment (how quickly do you need to detect a lost connection and how many times should you retry before giving up).

e.g. Here are 10 minutes of retry attempts with a 5 second timeout.

StanOptions so = StanOptions.GetDefaultOptions();
so.PingInterval = 5000;
so.PingMaxOutstanding = 120;

In addition, one can schedule a manual reconnect in the ConnectionLostEventHandler and it'd be good practice to log there for debugging.

robertmircea commented 4 years ago

@ColinSullivan1 It is only a test app to demonstrate the issue. Ping timeout was set so low only to show the exception with Connection closed which never recovers.

In the example, I wanted to surface the bug about accessing the stream's Position. It seems to occur when a ping request is about to be sent to the server and the server is killed. In my opinion, since ping is initiated by the library without developer's control, it would be desirable not to crash the app and throw unexpected exceptions since I would expect the library to reconnect by itself silently when a disconnection occurs. The event handlers are useful if my app requires to handle these events.

ColinSullivan1 commented 4 years ago

I think disabling the reconnect buffer would fix this, and would be in-line with the golang and C clients. Given this is NATS streaming, I don't think the client should make an effort to resend (and use the memory buffer) - the streaming server will handle missed messages.

Internally, we'll need to override the ReconnectBufSize option... In the meantime, if you add this to your NATS options, it'll avoid this issue. opts.ReconnectBufSize = Options.ReconnectBufferDisabled

danielwertheim commented 4 years ago

@ColinSullivan1 Will setting it to Options.ReconnectBufferDisabled work? It seems like the NATS Client will then throw an exception:

public static readonly int ReconnectBufferDisabled = -1;
if (status == ConnState.RECONNECTING)
{
    int rbsize = opts.ReconnectBufferSize;
    if (rbsize != 0)
    {
        if (rbsize == -1)
            throw new NATSReconnectBufferException("Reconnect buffering has been disabled.");

        if (flushBuffer)
            bw.Flush();
        else
            kickFlusher();

        if (pending != null && bw.Position + count + pubProtoLen > rbsize)
            throw new NATSReconnectBufferException("Reconnect buffer exceeded.");
    }
}
danielwertheim commented 4 years ago

@ColinSullivan1 Think I found it, it should be Unbounded, not Disabled, right?

opts.ReconnectBufferSize = Options.ReconnectBufferSizeUnbounded
ColinSullivan1 commented 4 years ago

This was resolved - the reconnect buffer was disabled.

robertmircea commented 4 years ago

How can I test the fix and see if the error was solved? Are there going to be published new nuget packages soon for NATS and STAN clients? Thanks!

ColinSullivan1 commented 4 years ago

I've pushed a core NATS prerelease: https://www.nuget.org/packages/NATS.Client/0.10.1-pre2

This should address the issue. If you disable the reconnect buffer through the NATS connection options:

opts.ReconnectBufSize = Options.ReconnectBufferDisabled

Then you should be good to go.

robertmircea commented 4 years ago

Upgraded to latest NATS.Client but now I encounter a different unhandled error (which crashes the test app) in the timer callback which handles automatic ping to NATS server:

NATS client config:

var opts = ConnectionFactory.GetDefaultOptions();
            opts.Url = "nats://localhost:4222";
            opts.ReconnectBufferSize = Options.ReconnectBufferDisabled;
            opts.AllowReconnect = true;
            opts.PingInterval = 5000;
            opts.MaxPingsOut = 2;
            opts.MaxReconnect = Options.ReconnectForever;
            opts.Timeout = 4000;
            opts.Verbose = true;
/usr/local/share/dotnet/dotnet /Users/robert/Sandbox/StanTest/bin/Debug/netcoreapp3.0/StanTest.dll
00:00:00.0278589. Published message
00:00:01.0492146. Published message
00:00:02.0767149. Published message
00:00:03.0911927. Published message
NATS connection disconnected
00:00:07.6953715 - Type:STAN.Client.StanConnectionException. On Publish: eUnhandled exception. Unhandled exception. Unhandled exception. xception message: The NATS connection is reconnecting
Stan connection lost handler. Exception message: STAN.Client.StanMaxPingsException: Connection lost due to PING failure.
NATS.Client.NATSReconnectBufferException: Reconnect buffering has been disabled.
   at NATS.Client.Connection.publish(String subject, String reply, Byte[] data, Int32 offset, Int32 count, Boolean flushBuffer)
   at NATS.Client.Connection.Publish(String subject, String reply, Byte[] data)
   at STAN.Client.Connection.pingServer(Object state)
   at System.Threading.TimerQueueTimer.<>c.<.cctor>b__23_0(Object state)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location where exception was thrown ---
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.TimerQueueTimer.CallCallback(Boolean isThreadPool)
   at System.Threading.TimerQueueTimer.Fire(Boolean isThreadPool)
   at System.Threading.TimerQueue.FireNextTimers()
   at System.Threading.TimerQueue.AppDomainTimerCallback(Int32 id)
NATS.Client.NATSReconnectBufferException: Reconnect buffering has been disabled.
   at NATS.Client.Connection.publish(String subject, String reply, Byte[] data, Int32 offset, Int32 count, Boolean flushBuffer)
   at NATS.Client.Connection.Publish(String subject, String reply, Byte[] data)
   at STAN.Client.Connection.pingServer(Object state)
   at System.Threading.TimerQueueTimer.<>c.<.cctor>b__23_0(Object state)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location where exception was thrown ---
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.TimerQueueTimer.CallCallback(Boolean isThreadPool)
   at System.Threading.TimerQueueTimer.Fire(Boolean isThreadPool)
   at System.Threading.TimerQueue.FireNextTimers()
   at System.Threading.TimerQueue.AppDomainTimerCallback(Int32 id)NATS.Client.NATS
Process finished with exit code 6.
NATS-CI commented 4 years ago

Thanks for testing this out Robert. We'll add a fix soon.

On Mon, Dec 9, 2019 at 2:29 AM Robert Mircea notifications@github.com wrote:

Upgraded to latest NATS.Client but now I encounter a different unhandled error (which crashes the test app) in the timer callback which handles automatic ping to NATS server:

NATS client config:

var opts = ConnectionFactory.GetDefaultOptions(); opts.Url = "nats://localhost:4222"; opts.ReconnectBufferSize = Options.ReconnectBufferDisabled; opts.AllowReconnect = true; opts.PingInterval = 5000; opts.MaxPingsOut = 2; opts.MaxReconnect = Options.ReconnectForever; opts.Timeout = 4000; opts.Verbose = true;

/usr/local/share/dotnet/dotnet /Users/robert/Sandbox/StanTest/bin/Debug/netcoreapp3.0/StanTest.dll 00:00:00.0278589. Published message 00:00:01.0492146. Published message 00:00:02.0767149. Published message 00:00:03.0911927. Published message NATS connection disconnected 00:00:07.6953715 - Type:STAN.Client.StanConnectionException. On Publish: eUnhandled exception. Unhandled exception. Unhandled exception. xception message: The NATS connection is reconnecting Stan connection lost handler. Exception message: STAN.Client.StanMaxPingsException: Connection lost due to PING failure. NATS.Client.NATSReconnectBufferException: Reconnect buffering has been disabled. at NATS.Client.Connection.publish(String subject, String reply, Byte[] data, Int32 offset, Int32 count, Boolean flushBuffer) at NATS.Client.Connection.Publish(String subject, String reply, Byte[] data) at STAN.Client.Connection.pingServer(Object state) at System.Threading.TimerQueueTimer.<>c.<.cctor>b__23_0(Object state) at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state) --- End of stack trace from previous location where exception was thrown --- at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state) at System.Threading.TimerQueueTimer.CallCallback(Boolean isThreadPool) at System.Threading.TimerQueueTimer.Fire(Boolean isThreadPool) at System.Threading.TimerQueue.FireNextTimers() at System.Threading.TimerQueue.AppDomainTimerCallback(Int32 id) NATS.Client.NATSReconnectBufferException: Reconnect buffering has been disabled. at NATS.Client.Connection.publish(String subject, String reply, Byte[] data, Int32 offset, Int32 count, Boolean flushBuffer) at NATS.Client.Connection.Publish(String subject, String reply, Byte[] data) at STAN.Client.Connection.pingServer(Object state) at System.Threading.TimerQueueTimer.<>c.<.cctor>b__23_0(Object state) at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state) --- End of stack trace from previous location where exception was thrown --- at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state) at System.Threading.TimerQueueTimer.CallCallback(Boolean isThreadPool) at System.Threading.TimerQueueTimer.Fire(Boolean isThreadPool) at System.Threading.TimerQueue.FireNextTimers() at System.Threading.TimerQueue.AppDomainTimerCallback(Int32 id)NATS.Client.NATS Process finished with exit code 6.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/nats-io/stan.net/issues/162?email_source=notifications&email_token=AE63KGJOVSCPU6MI57HFUFLQXYFWNA5CNFSM4JTHTZ4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGIOEFI#issuecomment-563143189, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE63KGKLR5VZOG6HV4STIQDQXYFWNANCNFSM4JTHTZ4A .

robertmircea commented 4 years ago

Is this fix completed and library pushed to nuget.org? I would like to give it another try.

ColinSullivan1 commented 4 years ago

I will get to this today.

ColinSullivan1 commented 4 years ago

I'll be looking into a few test issues with this fix over the next week - please be patient over the holidays.

ColinSullivan1 commented 4 years ago

@robertmircea, Thanks for your patience. I've published a prerelease with a fix - please let me know how it goes.

https://www.nuget.org/packages/STAN.Client/0.2.1-pre1

robertmircea commented 4 years ago

I've tested again. A little bit of improvement meaning that NATS reconnection seems to be performed correctly, but unfortunatelly the publisher cannot publish any new message after the initial connection was lost. It throws continuously on every publish attempt with "connection closed" and never recovers, as it should.

/usr/local/share/dotnet/dotnet /Users/robert/Sandbox/StanTest/bin/Debug/netcoreapp3.0/StanTest.dll
00:00:00.5892349. Published message
00:00:01.6119463. Published message
00:00:02.6329777. Published message
NATS connection disconnected
00:00:03.6464577 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:04.6505331 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
Stan connection lost handler. Exception message: STAN.Client.StanMaxPingsException: Connection lost due to PING failure.
00:00:05.6508873 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:06.6522138 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:07.6542787 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:08.6550990 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
NATS server reconnected.
00:00:09.6565906 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:10.6596245 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:11.6646643 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:12.6691837 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:13.6708061 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:14.6717212 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
^C
Process finished with exit code 2.
ColinSullivan1 commented 4 years ago

Thanks for testing! We must be missing a test for this, I'd have thought this would have been picked up by testing. I'll look into this right now.

ColinSullivan1 commented 4 years ago

On a second look, this is behaving as expected. Once the STAN connection is lost, you need to re-establish the connection. The underlying NATS connection has been re-established, but from the perspective of the STAN client the NATS streaming server is not available.

In your case, the NATS streaming server is running in embedded mode (there's a core NATS server embedded in the NATS streaming server), so it may seem odd that the NATS server connection recovered and the NATS streaming server connection did not. Based on the configuration of the NATS streaming pings the streaming connection is lost, while the core NATS configuration allowed the core NATS connection to reconnect. In your case this can be remedied by ensuring the STAN ping parameters (StanOptions.PingInterval and StanOptions.PingMaxOutstanding) are equivalent to or more tolerant than the core NATS connection configuration. If you never want the STAN client to determine a connection has been lost and instead rely on the underlying core NATS connection, just set StanOptions.PingInterval and StanOptions.PingMaxOutstanding to large values.

The different ping settings are in place because the NATS streaming server can run as part of a core NATS cluster or standalone mode, where the client does not have a direct connection to the streaming server.

Does this make sense?

P.S. In JetStream, you will not have this issue as JetStream functionality will be embedded into a core NATS server.

robertmircea commented 4 years ago

Many thanks for your time spent on this issue! Is kind of odd and I can never imagine a practical situation in which I would not like that a STAN client to recover when ping failed but the enclosed NATS connection is alive. I was led by the impression that STAN connection should be long lived (and act as a singleton) similar to NATS connection object. In my prod app, I was registering the STAN connection as a singleton with the DI container. So, to conclude: you are telling me that:

Either:

  1. disable (set very high timeout/no of pings) ping on STAN connection, but leave it on for NATS connection
  2. or, recreate STAN connection every time when the underlying connection to NATS is closed.

Could you reconsider the architectural decision of not recovering STAN connection in case that NATS streaming server is running in embedded mode and STAN client using a not self-managed NATS connection and make somehow transparent for the STAN client how the NATS server is ran?

I will try to reproduce this test using the java streaming server client and see if the behaviour is similar.

ColinSullivan1 commented 4 years ago

@robertmircea, absolutely, we're happy to help and thank you for working through this issue with is. I hear you and do understand that it doesn't make a lot of sense when you know the streaming server is the server your client is connected to.

For the time being, option 1 (disable) is what I recommend. I'm not sure how many new features we'll add into the STAN client at this time, but will take a look at the effort and implications.

Moving forward with JetStream (beta next quarter) you won't need to worry about this problem. JetStream is built into core NATS at the server and you won't need to manage two connection types in the client (everything there will be core NATS as well). We're aiming to make the transition easy.

robertmircea commented 4 years ago
  1. could you give more info / links on what will be possible with JetStream?
  2. please officially publish the updated NATS and STAN clients in Nuget.org (now they are pre-release)

Thank you!

robertmircea commented 4 years ago

One more thing which I observed right now while trying to close a STAN connection while NATS Streaming server was down. The app crashed with unhandled exception because opts.ReconnectBufferSize = Options.ReconnectBufferDisabled.

I don't know if is normal, but maybe it should be somehow documented.

The stacktrace:

Unhandled exception. NATS.Client.NATSReconnectBufferException: Reconnect buffering has been disabled.
   at NATS.Client.Connection.publish(String subject, String reply, Byte[] data, Int32 offset, Int32 count, Boolean flushBuffer)
   at NATS.Client.Connection.requestSync(String subject, Byte[] data, Int32 offset, Int32 count, Int32 timeout)
   at NATS.Client.Connection.request(String subject, Byte[] data, Int32 offset, Int32 count, Int32 timeout)
   at NATS.Client.Connection.Request(String subject, Byte[] data)
   at STAN.Client.Connection.Close()
   at SGw.Program.Program.cs:line 98
ColinSullivan1 commented 4 years ago

Nope, not normal - that's another place we need to check for that exception. I'll fix this today, and once reviewed and merged, publish a release.

ColinSullivan1 commented 4 years ago

@robertmircea , here you go... https://www.nuget.org/packages/STAN.Client/0.2.1

robertmircea commented 4 years ago

Thank you!

shokri-navid commented 3 years ago

I have encountered the same situation in version 0.3.0. when my Stan connection entered to Connection Closed Mode it never comes to connect state and all publish requests throw the Connection Close Exception.

this is my code to register stan Connection:

public static class Extensions
    {
        private static IStanConnection _stanConnection; 
        public static IServiceCollection AddStanClient(this IServiceCollection services, StanSettings settings)
        {

            var stanOptions = StanOptions.GetDefaultOptions();
            var natsOptions = ConnectionFactory.GetDefaultOptions();
            var stanClientId = $"{settings.ClientId}-{Guid.NewGuid()}";
            var stanClusterId = settings.ClusterId;
            natsOptions.Servers = settings.NatsServers.ToArray();
            natsOptions.AllowReconnect = true;
            natsOptions.MaxReconnect = Options.ReconnectForever;
            natsOptions.PingInterval = 2000;
            natsOptions.MaxPingsOut = 2;
            natsOptions.ReconnectWait = 1000;
            natsOptions.ReconnectBufferSize = Options.ReconnectBufferSizeUnbounded;
            natsOptions.Timeout = 4000;

            natsOptions.ClosedEventHandler += (object sender, ConnEventArgs arg) =>
            {
                Console.WriteLine($"connection lost on connection {arg.Conn.ConnectedId}" +
                                  $"{Environment.NewLine} {arg.Error.StackTrace}");
            };

            natsOptions.ReconnectedEventHandler += (object sender, ConnEventArgs arg) =>
            {
                stanOptions.NatsConn = arg.Conn;
                Console.WriteLine($"connection Recovered by connection id {arg.Conn.ConnectedId}");
            }; 

            stanOptions.NatsConn = new ConnectionFactory().CreateConnection(natsOptions);
            stanOptions.ConnectionLostEventHandler += (object sender, StanConnLostHandlerArgs arg) => {
                Console.WriteLine($"connection lost on connection {arg.Connection.NATSConnection.ConnectedId}" +
                                  $"{Environment.NewLine} {arg.ConnectionException.StackTrace}"); 
            };

            _stanConnection = new StanConnectionFactory().CreateConnection(stanClusterId, stanClientId, stanOptions); 
            services.AddSingleton<IStanConnection>( _stanConnection);

            return services;
        }
    }

and I use this connection like this :

static Task Main(string[] args)
        {
            var host = CreateHostBuilder(args).Build();

            var scope = host.Services.CreateScope();

            var watch = new Stopwatch();
            watch.Start();
            while (true)
            {
                var stanConnection = scope.ServiceProvider.GetRequiredService<IStanConnection>();
                try
                {
                    stanConnection.Publish("test", new byte[0x1]);
                    Console.WriteLine("{0}. Published message", watch.Elapsed);
                }
                catch (Exception e)
                {
                    Console.WriteLine("{0} - Type:{1}. On Publish: exception message: {2}", watch.Elapsed, e.GetType(), e.Message);
                }
                finally
                {
                    Thread.Sleep(1000);
                }
            }
            Console.WriteLine("Hello World!");

            return host.StartAsync();
        }

        static IHostBuilder CreateHostBuilder(string[] args) =>
            Host.CreateDefaultBuilder(args)
                .ConfigureServices((_, services) =>
                    services.AddStanClient(new StanSettings
                    {
                        ClientId = "uniq123",
                        ClusterId = "test-cluster",
                        NatsServers = new List<string>{"nats://localhost:4222"}
                    })); 

    }
and this is the output of my code when I stop the Stan service and after a while, I start it again :

00:00:13.7476360. Published message 00:00:14.7542784. Published message 00:00:15.7602945. Published message 00:00:16.7658160. Published message 00:00:17.7862428. Published message 00:00:18.8041595. Published message 00:00:20.6005542 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting 00:00:21.6487055 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting 00:00:22.7012163 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting 00:00:23.7444252 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting 00:00:24.7905985 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting 00:00:25.8348325 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting 00:00:26.8816522 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting 00:00:27.9295129 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting 00:00:28.9737369 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting 00:00:30.0154864 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting 00:00:31.0607351 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting 00:00:32.1057689 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting 00:00:33.1495434 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting 00:00:34.1919119 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting 00:00:35.2404412 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed. 00:00:36.3016650 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed. 00:00:37.3464719 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed. 00:00:38.3887135 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed. 00:00:39.4637382 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed. 00:00:40.5152720 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed. 00:00:41.5576260 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed. connection Recovered by connection id NDIVMUUHKJN644WRMAFRWW7GB5TD3ZPYVE2CIJWFMCLJDXCWIBDURO6L 00:00:42.6050295 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed. 00:00:43.6454686 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed. 00:00:44.6864267 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed. 00:00:45.7310466 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.