Let's Talk About Clustering

expelledboy commented 7 years ago

If there currently is no public discussion occurring around this topic, I would like to open the forum! :)

Following on from nats-io/node-nats-streaming#46

Just to re-iterate what has already been discussed.

Currently nats-streaming-server supports clustering mechanisms such as Fault Tolerance and Partitioning, where data redundancy needs to be handled manually via a NFS mount or similar.

Apcera is looking to incorporate a RAFT-like protocol, with special focus on scalability, high performance (aggregate), and high availability in the clustering solution.

@ColinSullivan1 My naive thoughts were, if there is a design in mind mock out the implementation without focusing on performance leveraging datastores that already support clustering.

Having read more into redundancy of durable event streams I see why its not that simple. Going to have to do more reading.

Still I am interested to hear what research has already been done, and what the main considerations are from your perspective.

ColinSullivan1 commented 7 years ago

@expelledboy , thank you for creating this issue and opening the forum!

Some research has been done around using CockroachDB, but that was under a different context and discarded due to performance considerations. That being said, it may work for a store where performance is not a requirement, but scalability is.

Stepping back to requirements, I can see something like this as addressing the IoT use case: Supporting many millions of subscriptions (channels internally to the NATS streaming server), described in #168, where each device might have its own subscription. For a store backed by a db or big data technology, this is feasible. The file based store is extremely fast, but doesn't scale horizontally as well as this would, due to resource constraints.

I described one use case, but there are many others this could solve... another benefit here could be if the backing store technology provides a way to introspect, one could analyze data stored in the NATS streaming server logs. ...or, rather than scaling wide, provide depth in housing vast amounts historical data to replay and analyze.

Going back to the technology related concerns, we do have a issue today in that a client can only be serviced by exactly one streaming server at one time. This can be addressed through the existing partitioning features today. High availability would be offered by the existing fault tolerance feature. With those features, and a consistent, replicated, back end store, this would be solid addition to NATS streaming.

The low level design is already (mostly) ironed out - the first place to start mocking something out would be to implement a store using the store interface, and build the streaming server supporting a new store type. You'd need a golang API into the backing datastore.

IMO the key to success here would be the identify the best backing store type that can be replicated, is fault tolerant, etc, with performance that is sufficient for most use cases. I'd love to hear thoughts from from you and the community on what may be best.

expelledboy commented 7 years ago

Never heard of CockroachDB, looks like its going to be a solid product in a couple of years. Imagine you picked CockroachDB, could you provide a brief overview of how you would integrate it?

Have you considered scylladb?

codmajik commented 7 years ago

@ColinSullivan1 @expelledboy Have you considered BoltDB as a more advanced/powerful version of the file store. It is also introspectable which meets some of the items you raised above

ColinSullivan1 commented 7 years ago

@codmajik , @expelledboy , Thank you for the suggestions. IMO both have their merits. BoltDB could provide the scalability requirement and would likely perform well, but users would still have to craft their own backup solution. This sounds like a great option for scaling with fault tolerance / availability features we have today - an alternative to the file based option we have today. From a user/operator perspective, it'd be very straightforward to setup, adhering to the NATS tenet of simplicity.

scylladb looks good as well. I understand it performs well, and already supports clustering. It has been well received by other NATS users. One issue to consider is preventing multiple instances of streaming servers from concurrently writing to the store. Enhancing the existing fault tolerance mechanism to use the database, rather than a file, may be an option to safeguard against that scenario. Setup, table creation, etc, would certainly need to be streamlined.

In either case, implementing a store is fairly straightforward to do. Naturally, the corner cases occur with locking, error handling, timeouts, etc.

The new store would need to implement the following interfaces:

I'd suggest looking at the default memory based store, memstore as a starting point.

The store would need to be created, command line params added, and type / options would need to be parsed.

@kozlovic , any thoughts here?

Great feedback so far!

kozlovic commented 7 years ago

@ColinSullivan1 very well summarized.

To make sure we are on the same page, if we talk about adding a DB store with current Streaming design, either Standalone or with Channels partitioning, the store is still accessed by a single Streaming server at a given time. The FT case requires that the storage be accessible by servers in the same FT group, but only 1 server is active in the group. The storage needs to implement an "exclusive lock" to ensure that guarantee.

Implementing and then using a DB store that would be used by a cluster of Streaming servers would indeed require changes to the Streaming server code itself, and it is not that simple.

Are clients connecting to the cluster or to a single server? Honoring the unique client ID means that the server would not be allowed to maintain this state solely in memory but would have to rely on the DB to know that a client already exists with the given client ID.

The internal subjects used by the server would have to be made private so that when a given client sends a message, it goes to only 1 server (so that messages are not written multiple times).

A subscription created on server A would have to be serviced even if a publisher connects and sends on server B. That would require some notification between servers in the cluster.

Failure of a server in the cluster would not be detected by a client since there is no direct connection between streaming clients and servers. The state would have to be transferred to another server in the cluster (connection, subscriptions, etc..).

expelledboy commented 7 years ago

@kozlovic Thank you. This is more what I wanted to know. Sounds like you have a good idea of what needs to be done, but more specifically what DB features are necessary, or "nice to have". With a spec we can evaluate the options?

tylertreat commented 7 years ago

FYI I am in the very early stages of beginning clustering work. The plan is to not push replication responsibility onto the persistence layer and instead implement a Raft-based replication protocol in NATS Streaming.

dahankzter commented 7 years ago

Are we there yet?

Jokes aside I find this very interesting to follow and if you have some musings or thoughts to share as you work in I would very much like to hear it.

tylertreat commented 7 years ago

Progress is definitely being made. To give you an idea of where we're at, I currently have a working POC that is replicating messages via Raft. Currently, it's using HashiCorp's implementation of Raft with a NATS transport implementation I wrote. My (optimistic) goal is to have an early MVP by the end of September.

ColinSullivan1 commented 7 years ago

@dahankzter , @tylertreat is making incredible progress! We're still ironing out exactly what features will be in it, but as Tyler said, something should be available for you to test drive around that time. Based on feedback from his early work, we'll then figure out a realistic beta date.

dahankzter commented 7 years ago

Excellent news! I can test whenever it starts to make sense.

ikaros commented 7 years ago

Hi 🙂 – first of all – thanks for the great work 👍

I just wanted to ask how's the progress on this :)

tylertreat commented 7 years ago

@ikaros You can follow progress in this integration branch: https://github.com/nats-io/nats-streaming-server/tree/clustering_mvp_integration

Summary status update:

[x] NATS Raft transport implementation (https://github.com/nats-io/nats-on-a-log)
[x] replication of channel messages (https://github.com/nats-io/nats-streaming-server/pull/383)
[x] snapshotting/catchup of channel messages (https://github.com/nats-io/nats-streaming-server/pull/383)
[x] replication of channel subscriptions (https://github.com/nats-io/nats-streaming-server/pull/390)
[x] replication of acks (https://github.com/nats-io/nats-streaming-server/pull/395)
[x] replication of subscription positions (https://github.com/nats-io/nats-streaming-server/pull/395)
[x] replication of connections (https://github.com/nats-io/nats-streaming-server/pull/398 - in progress)
[x] snapshotting/catchup of channel subscriptions
[x] snapshotting/catchup of acks
[x] snapshotting/catchup of subscription positions
[x] snapshotting/catchup of connections
[ ] robust integration test suite for failure cases (e.g. https://github.com/jepsen-io/jepsen)

So didn't quite hit my end-of-September goal, but a lot of progress has been made. :) It's definitely getting close, though of course there will need to be extensive testing done before we consider releasing anything official, but I'm also hoping folks will help "beta" test. The snapshotting/catchup should be less work for these remaining items than it was for messages.

nick-zh commented 6 years ago

@tylertreat is there a rough ETA for clustering?

tylertreat commented 6 years ago

@nick-zh End of Q4 for a pre-production MVP.

rhzs commented 6 years ago

Hi @tylertreat it's new year now 🎉 🎉 🎉 any news?

ColinSullivan1 commented 6 years ago

@incubus8 , Thank you for following up - we're taking a look at where we are now, and will provide an update in the next few weeks or so.

expelledboy commented 6 years ago

Thank you @tylertreat !!

I had a quick skim through the code, but was wondering if any performance tests were done?

kozlovic commented 6 years ago

@expelledboy Locally yes. We are planning on running some tests in the cloud. As always, in term of performance, you would be better of setting up your own test which would match closely what your real deployment and usage would look like.

nats-io / nats-streaming-server

Let's Talk About Clustering #316