Should functions be synchronized among cluster nodes?

CharlesChen888 commented 5 months ago

This issue was first raised in Redis community redis/redis#11780, it is a major miss from Redis 7.0, but haven't got a solution. I believe we still need to discuss this issue in this new project, perhaps as a part of new cluster architecture.

The problem:

Let's say:

a slot, which includes key1, is migrated from node A to node B, operated by client1
in node A a function myfunc is already loaded, but in node B it is not

Now client2, that used to handle key1 in node A, sends FCALL myfunc 1 key1 to node A, which will be redirected to node B, will then get an unexpected error. Or if node B also has a function myfunc but it does different things, client2 may trigger unexpected results.

Some thoughts:

1.Should functions be migrated together with slots? Maybe not. This may overwrite the libraries already loaded in the nodes importing slots. But we may set a version to each library so that we can always choose to keep the latest one.

2.Can functions be broadcasted to all nodes? May cause confict between nodes. And when adding new nodes we need to have a full synchronization.

hpatro commented 5 months ago

1.Should functions be migrated together with slots?

I don't think slots should be associated with functions. functions should be treated as a resource at node level. On each node addition, all functions existing in the cluster/standalone primary should be applied.

Overall, I think we should treat functions, ACLs and config(s) as resource(s) which should be applied to all the nodes and build it in the cluster revamp which we have been talking for a while.

zuiderkwast commented 5 months ago

IIRC it was decided that functions are similar to configuration and that it's the admin's responsibility to define functions on all nodes. I don't know if it's a good decision.

The official admin tool is valkey-cli, so when managing a cluster (creating a cluster, adding nodes, etc.) perhaps valkey-cli should try to synchronous functions across nodes? That would at least match the current design decisions.

It's the same problem with ACL and other config in a cluster: They are expected to be the same throughout the cluster but the cluster doesn't guarantee this.

I'm not sure I believe in a complete rewrite of the cluster bus. Usually such projects are never finished. Can we try gossipping this information instead?

madolson commented 5 months ago

IIRC it was decided that functions are similar to configuration and that it's the admin's responsibility to define functions on all nodes. I don't know if it's a good decision.

Zhao can keep me honest, but this is the only time there was a real 3-2 decision amongst the old maintainers. (It was a more broad discussion on functions, but basically they said this is what Redis wanted).

I generally want to implement a new system that is distributed configuration, that supports functions, ACLS, configs and allows modules to also store distributed configuration. This can be built ontop of the existing clusterbus extensions, but will be in its best form when built with the cluster v2. (I see Viktor is dubious, but Ping and I are highly motivated to make this happen now)

zuiderkwast commented 5 months ago

I'm motivated, but I also realize we have a lot of things going on, and I want us to make releases more often. Can we make it a focus area for 9.0? (That means start prototyping now.)

madolson commented 5 months ago

I'm motivated, but I also realize we have a lot of things going on, and I want us to make releases more often. Can we make it a focus area for 9.0? (That means start prototyping now.)

My vision, is that we finish the needed refactoring work to make cluster a module for valkey 8. Then it becomes easy to iterate and prototype in a separate repo the new clustering work. Then someone can "opt-in" to the beta of it while Valkey 8 is out and about, and we can finalize it for 9.

zuiderkwast commented 5 months ago

I think we shall start using some of the reserved bits in the clusterMsg to indicate support for new cluster bus messages. From there, we can add new message types and turn the cluster bus more into a proper raft cluster, step by step.

Easier to test, easier to get things right, if we do things in smaller steps.

PingXie commented 5 months ago

Timeline aside, this is actually a problem that can ONLY be properly solved by cluster V2 because it gives you a strongly consistent (meta)data service that manages cluster-wide metadata like cluster topology, node configs, ACLs, and Functions, etc

Cluster V1 being eventually consistent (which is not 100% true by the way) and the coupling of data serving logic and topology management logic on the same node are the source of these oddities, IMO. I also think implementing a true raft cluster inside of the Valkey server is no less work than cluster V2 itself. The coupling shouldn't be there in the first place.

I am in favor of moving the entire cluster out of the engine and into its own module and this needs to be the precursor to the real cluster V2 work. That said, I don't have a clear idea on the complexity involved so I don't know if it can happen in Valkey 8, yet. Also agreed that building cluster V2 into the engine is a non-starter.

zuiderkwast commented 5 months ago

It's very good that we're having this discussion. We need to have all the information and aspects on the table to be able to make informed decisions.

Timeline aside

Beautiful, but I don't think we shall ignore time.

Just adding module APIs normally involves a lot of work just in terms of API design, testing and documentation. This makes adding APIs much more work than adding internal functions.

Which module API functions do we need to add and what will these APIs look like? Will the cluster module keep all the cluster state in its own memory, intercept commands and sometimes return redirects for them?

cluster V2 because it gives you a strongly consistent (meta)data service that manages cluster-wide metadata like cluster topology, node configs, ACLs, and Functions, etc

So we need to have module APIs for all of those, including some event mechanism so if a user or any other module changes any of these, so the cluster module can get these changes. (Not a very small and easy job.)

The coupling shouldn't be there in the first place. Also agreed that building cluster V2 into the engine is a non-starter.

Who agreed on this? If we're using a raft library, we avoid that coupling. A node can still be a cluster node without serving any data. Agreed that moving things to a module is a guarantee when it comes to separation of concerns, but it is also much more work.

daniel-house commented 5 months ago

Can anyone suggest a use case where we would actually prefer to have different implementations of the same function on different nodes? If not, then we have the answer to the question raised by this issue.

zuiderkwast commented 5 months ago

@daniel-house You're right, it's better that they are synchronized. The open question is how.

Can you suggest how to implement it within the current cluster solution that makes sure they are synchronized on all nodes?

daniel-house commented 5 months ago

OK, we agree that they should be synchronized.

I can't say I understand the current cluster solution, but I can see that we are having problems synchronizing hostnames (https://github.com/valkey-io/valkey/issues/304), so I have serious doubts that we can synchronize functions at this time. It seems to me that a good solution would also solve the hostnames issue and the problems raised in https://github.com/valkey-io/valkey/issues/114 . If anyone can point me to useful reading about the current cluster solution, please do.

I would appreciate any pointers to more information about cluster V2. I get the general impression that it will use Raft leader election, and would be perfect for this, hostnames, and the ultimate goals of https://github.com/valkey-io/valkey/issues/114 . If so, I would suggest punting all of these issues until after cluster V2 is implemented and assigning a very high priority to cluster V2.

zuiderkwast commented 5 months ago

If anyone can point me to useful reading about the current cluster solution, please do.

The cluster spec
The cluster manual
The source code

zuiderkwast commented 5 months ago

If we want to do something like the Redis cluster v2, then I believe we can have cluster V2 in a few years, say Valkey 10 or 11, maybe not even that. Rewriting something from scratch that can't be used until everything is in place is very likely to never be launched IMO. That's why I think Redis will fail. I hope we will choose a different, incremental approach. It's not that easy but I believe it's necessary. (We haven't really discussed it enough in the Valkey core team.)

daniel-house commented 5 months ago

If anyone can point me to useful reading about the current cluster solution, please do.

The cluster spec

The cluster manual

Thanks. I've looked at those in the past, and again today. They are indeed useful. (1) Are the docs up to date? (2) The docs seem too superficial for this purpose (proposing an extension to synchronize functions). Searching those docs for host doesn't lead to information about a hostname extension as described in https://github.com/valkey-io/valkey/issues/304 . I was hoping for the design or at least the discussion/issue and PR that lead to https://github.com/valkey-io/valkey/issues/304 .

The information about a gossip protocol is helpful, but which one?

The source code

Ha, ha. Very funny. And, sadly, all I really expected to find.

According to the cluster-spec:

For every node added in the gossip section the following fields are reported:

Node ID.
IP and port of the node.
Node flags.

Why didn't we add the hostname here instead of ping/pong?

I'll read some more from those docs and the code.

daniel-house commented 5 months ago

If we want to do something like the Redis cluster v2, then I believe we can have cluster V2 in a few years, say Valkey 10 or 11, maybe not even that. Rewriting something from scratch that can't be used until everything is in place is very likely to never be launched IMO. That's why I think Redis will fail. I hope we will choose a different, incremental approach. It's not that easy but I believe it's necessary. (We haven't really discussed it enough in the Valkey core team.)

This link is something I really needed. Thankyou.

So, you think cluster V2 will never happen. Is that the consensus? I will read it in detail. Even if it is far in the future it will help me understand today's pain points. Are there any places I should subscribe to so that I can follow any discussions of cluster V2?

How about this incremental approach? Link in a high quality implementation of the Raft protocol. Define a state-machine to manage with the Raft protocol. Only put one thing in it, a list of node-IDs with hostnames. After that works, add function definitions to the state-machine. We should then be able to incrementally extend the state-machine to cover everything needed for https://github.com/valkey-io/valkey/issues/114

Alternatively, use CRDT registers (simple strings) to follow the same incremental path.

zuiderkwast commented 5 months ago

So, you think cluster V2 will never happen.

That's not what I said. (I said that if we want to do it in a certain non-incremental way, then it will be difficult.)

Is that the consensus?

No.

PingXie commented 5 months ago

I don't think we should reference back to the old cluster V2 design here in the Valkey forum. That work was started by Redis. We should take a fresh look at the problems again and start from scratch. The only connection is just the name "cluster V2". Please disregard the "Redis Cluster V2" design completely.

daniel-house commented 5 months ago

@PingXie Works for me. Is there a list of problems/requirements?

hpatro commented 5 months ago

@PingXie Works for me. Is there a list of problems/requirements?

Few of the problems/requirements:

Support higher Scaling (Larger cluster size) - Currently Valkey cluster can't scale beyond 500 nodes beyond which the gossip protocol doesn't scale well. The new architecture should strive to handle much larger cluster size.
Support smaller cluster size - With the current design, only primary nodes can vote hence the ideal cluster would require atleast 3 primaries to form a quorum. The new design should take into consideration where each node (primary/replica) should be part of the quorum.
Centralized metadata store - Currently config, functions, acl needs to be applied on each node. This is painful and cumbersome during cluster setup as well as during scale out. Building a mechanism to be able to send those information once and being applied to all of the nodes would be ideal.
Decouple control and data transfer via cluster bus - Currently, the clusterbus carries both control data (health info, gossip info, etc) as well as pubsub data. With high pubsub data, the clusterbus can get overwhelmed and cause issue with the cluster health status and can delay cluster topology information consistency in the system.
Lower failover time - Due to the primary node only consensus mechanism in the current approach, failover in a shard can be delayed if there is a network partition and other primaries aren't reachable.

PingXie commented 5 months ago

Thanks @hpatro.

Here are the problems that I think we need to solve in the current cluster:

strong consistency (for cluster topology)

cluster topology is concerned with which nodes own which slots and primaryship. The current cluster implementation is not even eventually consistent by design because there are places where node epochs are bumped without consensus (trade-offs). This leads to increased complexity on the client side.

better manageability (of global config/data)

This particular issue provides the exact context on this pain point

more resilience (to stressful client workload)

Today, both the cluster bus and the client workload run on the same main thread. So a demanding client workload has the potential to starve the cluster bus and leads to unnecessary failover.

higher scale

The V1 cluster is a mesh so the cluster gossip traffic is proportional to N^2, where N is the (data) nodes in the cluster. The practical limit of a V1 cluster is ~500 nodes.

daniel-house commented 4 months ago

I don't see how you would perform a rolling upgrade to a cluster that meets these requirements. Is that acceptable?

PingXie commented 4 months ago

I don't see how you would perform a rolling upgrade to a cluster that meets these requirements. Is that acceptable?

We haven't finalized the v1 to v2 ~~upgrade~~ migration path, including whether such a need exists or not. A seamless migration path is always great but I can also see the two cluster implementations continue to evolve on their own for a long time. So in this sense, v2 is not an accurate moniker, strictly speaking.

zuiderkwast commented 4 months ago

We haven't finalized almost anything about v2. :)

I think we need many different threads to discuss all these aspects. Shall we use a wiki for that? Or just many issues and sub-issues to discuss?

I think we should really explore the possibilities for a running cluster to auto-upgrade to a new consensus mode. For example, the node with the highest epoch can decide to switch to a raft algorithm, when all known nodes are known to support it, and start sending out new messages on a new format. When such message reach the other nodes, they switch too. We can use the existing TCP connections. (We could use UDP but it's harder to get DTLS working (OpenSSL doesn't even have DTLS 1.3 yet) than to just use TCP and TLS. But that's a different discussion.)

This sketch ↑ is just to reason that it's theoretically possible. Then we can weigh how important it is, compared to safety, simplicity, separation and other aspects.

PingXie commented 4 months ago

We haven't finalized almost anything about v2. :)

True but my statement is correct too :-)

I think we need many different threads to discuss all these aspects. Shall we use a wiki for that? Or just many issues and sub-issues to discuss?

Yeah, I will create a dedicated issue for cluster v2. Let's start from the pain points in today's cluster design/implementation. It is important to have a consensus on the problem definition first.

valkey-io / valkey

Should functions be synchronized among cluster nodes? #58

The problem:

Some thoughts: