nats-io / nats-streaming-server

NATS Streaming System Server
https://nats.io
Apache License 2.0
2.51k stars 284 forks source link

Client Publish Timeout on Reconnect After Stream Server Restart #150

Closed injaelee closed 8 years ago

injaelee commented 8 years ago

When the stream server is restarted, the client calls the reconnect callback but times out in publishing the message.

I've briefly gone through the open/closed issues with the keyword "reconnect". But I haven't seen anything related to the issue at hand. I am wondering whether I am missing something. I would appreciate your help.

Here is how to reproduce the problem.

1. Start the stream server.
2. Run the client that connects to the stream server.
3. Take the stream server down. (Ctrl-C)
4. Observe that the client reconnected to the stream server.
5. Publish message to the stream server.
6. Publish times out.

Here is my client code. [ Link: https://gist.github.com/injaelee/694bef09413f4e8829b63f589a64548d ]

Here is how I run the server.

go run src/github.com/nats-io/nats-streaming-server/nats-streaming-server.go -SDV

Here are the log messages from the server including the event of CTRL-C and the re-start.

Ooo> go run src/github.com/nats-io/nats-streaming-server/nats-streaming-server.go  -SDV
[6326] 2016/09/12 12:56:31.385102 [INF] Starting nats-streaming-server[test-cluster] version 0.2.2
[6326] 2016/09/12 12:56:31.385316 [INF] Starting nats-server version 0.9.2
[6326] 2016/09/12 12:56:31.385325 [INF] Listening for client connections on 0.0.0.0:4222
[6326] 2016/09/12 12:56:31.385402 [INF] Server is ready
[6326] 2016/09/12 12:56:31.467176 [TRC] STAN:  NATS conn opts: { [nats://localhost:4222] false NATS-Streaming-Server-test-cluster false false false <nil> true 60 2s 2s 2m0s 2 0xb87c0 0xb8590 0xb86c0 0xb8830 8388608 8192   }
[6326] 2016/09/12 12:56:31.720723 [DBG] Did not detect another server instance.
[6326] 2016/09/12 12:56:31.721031 [DBG] STAN: Discover subject:    _STAN.discover.test-cluster
[6326] 2016/09/12 12:56:31.721041 [DBG] STAN: Publish subject:     _STAN.pub.3VN8Z8ampTvnpBbfvi4lZr.>
[6326] 2016/09/12 12:56:31.721045 [DBG] STAN: Subscribe subject:   _STAN.sub.3VN8Z8ampTvnpBbfvi4lbO
[6326] 2016/09/12 12:56:31.721047 [DBG] STAN: Unsubscribe subject: _STAN.unsub.3VN8Z8ampTvnpBbfvi4lcv
[6326] 2016/09/12 12:56:31.721050 [DBG] STAN: Close subject:       _STAN.close.3VN8Z8ampTvnpBbfvi4leS
[6326] 2016/09/12 12:56:31.721241 [INF] STAN: Message store is MEMORY
[6326] 2016/09/12 12:56:31.721251 [INF] STAN: Maximum of 1000000 will be stored
[6326] 2016/09/12 12:56:38.227092 [DBG] STAN: [Client:test-client-00] Connected (Inbox=_INBOX.iaCMJhwL4ZimZOGg8hN6jx)
[6326] 2016/09/12 12:56:38.227386 [TRC] STAN: [Client:test-client-00] Acking Publisher subj=test-1 guid=iaCMJhwL4ZimZOGg8hN6zp
[6326] 2016/09/12 12:56:46.229096 [TRC] STAN: [Client:test-client-00] Acking Publisher subj=test-1 guid=iaCMJhwL4ZimZOGg8hN757
[6326] 2016/09/12 12:56:46.229560 [TRC] STAN: [Client:test-client-00] Acking Publisher subj=test-1 guid=iaCMJhwL4ZimZOGg8hN7AP
^C[6326] 2016/09/12 12:56:50.751982 [DBG] STAN: Shutting down.
[6326] 2016/09/12 12:56:50.752043 [DBG] STAN: connection has been closed
[6326] 2016/09/12 12:56:50.752222 [INF] Server Exiting..

Ooo> go run src/github.com/nats-io/nats-streaming-server/nats-streaming-server.go  -SDV
[6376] 2016/09/12 12:56:54.146735 [INF] Starting nats-streaming-server[test-cluster] version 0.2.2
[6376] 2016/09/12 12:56:54.146896 [INF] Starting nats-server version 0.9.2
[6376] 2016/09/12 12:56:54.146938 [INF] Listening for client connections on 0.0.0.0:4222
[6376] 2016/09/12 12:56:54.147008 [INF] Server is ready
[6376] 2016/09/12 12:56:54.227650 [TRC] STAN:  NATS conn opts: { [nats://localhost:4222] false NATS-Streaming-Server-test-cluster false false false <nil> true 60 2s 2s 2m0s 2 0xb87c0 0xb8590 0xb86c0 0xb8830 8388608 8192   }
[6376] 2016/09/12 12:56:54.483774 [DBG] Did not detect another server instance.
[6376] 2016/09/12 12:56:54.484028 [DBG] STAN: Discover subject:    _STAN.discover.test-cluster
[6376] 2016/09/12 12:56:54.484038 [DBG] STAN: Publish subject:     _STAN.pub.Jy3lDuQvpDxLEf3eS6zBoO.>
[6376] 2016/09/12 12:56:54.484044 [DBG] STAN: Subscribe subject:   _STAN.sub.Jy3lDuQvpDxLEf3eS6zBtf
[6376] 2016/09/12 12:56:54.484048 [DBG] STAN: Unsubscribe subject: _STAN.unsub.Jy3lDuQvpDxLEf3eS6zByw
[6376] 2016/09/12 12:56:54.484052 [DBG] STAN: Close subject:       _STAN.close.Jy3lDuQvpDxLEf3eS6zC4D
[6376] 2016/09/12 12:56:54.484393 [INF] STAN: Message store is MEMORY
[6376] 2016/09/12 12:56:54.484408 [INF] STAN: Maximum of 1000000 will be stored

Here is the output from the client code.

Publishing [test-1]: Message is [0]
Publishing [test-1]: Message is [1]
Publishing [test-1]: Message is [2]
[WARN] Client[test-client-00] disconnected.
Publishing [test-1]: Message is [3]
[WARN] Client[test-client-00] reconnected to nats://localhost:4222.
[ERROR] Publishing message 'Message is [3]': stan: publish ack timeoutPublishing [test-1]: Message is [4]
[ERROR] Publishing message 'Message is [4]': stan: publish ack timeoutPublishing [test-1]: Message is [5]
kozlovic commented 8 years ago

@injaelee Thanks for your interest in NATS Streaming!

I believe the issue is that you are not using the file based store (default is memory only), so no state is persisted when you restart the server. When the server restarts, it gets new subject prefixes and this is why your publisher is able to reconnect to NATS Server, but sent messages go nowhere so to speak, and so the publisher times-out. To run with a file based store:

nats-streaming-server -store file -dir datastore

Here, datastore is any directory where the server is going to persist the state. Hope this helps!

elakito commented 8 years ago

@kozlovic I have a question regarding this behavior as I was thinking of a possible workaround without using the file mode.

The approach that I experimented with was to create a nats.Conn in advance with a custom ReconnectHandler and pass it to stan.Connect to create a stan.Conn. In this way, the reconnect handler can recreate a stan.Conn (i.e., closing the old one and creating a new one using stan.Connect) when it is invoked.

This approach works fine but I see a few issues.

  1. the client doesn't know if the server is running on the memory mode, so it can't automatically activate this behavior. (i.e., needs to be configured). It would be nice if we can detect this condition from the client side.
  2. it would be simpler if stan.Conn itself has this reinitializing the subject prefixes and reestatablishing the subscriptions etc using the old stan.Conn than creating a new stan.Conn.

I would appreciate if you could comment on this approach.

regards, aki

kozlovic commented 8 years ago

@elakito Unfortunately, several things make this approach not ideal:

  1. It may be difficult from an application perspective to recreate the underlying stan objects while the application is running. The application may be in a middle of a send, or a message callback, etc... so coding the "reconnect" behavior inside the ReconnectHandler callback may present a challenge.
  2. Although by default NATS Streaming server embeds a NATS Server, there is no requirement that this be the case. In other words, it is possible that server and client connect to a standalone NATS Server (or even each be connected to a different NATS Server that would be routed, etc..). So if the Streaming server is stopped and restarted, the client would not be disconnected from the NATS Server and therefore there would be no ReconnectHandler being invoked.

More importantly, if you run the streaming server in memory-only mode, as I said, no state is preserved. From an application perspective, this may not mean much to stay running anyway. The memory-only mode value is essentially for client or network failures.

elakito commented 8 years ago

@kozlovic Thanks. I think point 1 is probably not complicated to handle but point 2 seems to be a real problem.

In that case, one last question, what is the reason for prefixing those subjects? The reason for asking is whether dropping the prefixes might be an option when using the memory mode.

regards, aki

kozlovic commented 8 years ago

Make them unique so that if you had multiple Streaming servers on the same network, communication between a client and its server would be "private".

elakito commented 8 years ago

@kozlovic I experimented with several approaches. For the pure in-memory approach using a static prefix, I needed to add a heuristic workaround to automatically re-register the old publisher client when it publishes first time after a broker restart. With that change, the publishing part worked fine but the subscription part didn't work because I can't automatically reactivate the old subscriptions.

So, to solve this problem, I tried another approach that persists the server, client, and subscription information. With this approach, the publishing part worked fine but the subscription part worked partially because the old subscriber may be waiting for a sequence n, but the newly restarted broker may be at sequence 1. In other words, the old subscribers will miss n messages after a restart. In order to avoid this situation, I needed to persist the last sequence number so this information can be recovered after a restart.

I will investigate a little more about this approach and write to the mailing list.

Thank you for your help.

elakito commented 8 years ago

I posted a message to https://groups.google.com/forum/#!topic/natsio/V4tvf8FVJ54