lni / dragonboat

A feature complete and high performance multi-group Raft library in Go.
Apache License 2.0
4.98k stars 533 forks source link

english https://github.com/lni/dragonboat/blob/master/docs/overview.CHS.md #283

Closed kolinfluence closed 1 year ago

kolinfluence commented 1 year ago

This document provides an overview of the Dragonboat library, which is written in Chinese. An English translation of the document is currently being worked on and will be available soon.

After reading this document, you will be able to use the Dragonboat library to implement various functions and components that require strong consistency in your project. It is assumed that you are familiar with Go programming, but you do not need to have a detailed understanding of Raft or any other distributed consensus algorithm.

As an introductory engineering document, this document intentionally avoids theoretical and principled descriptions. Users who are interested in this area are encouraged to refer to various papers on distributed algorithms.

Distributed consensus algorithm

If you already have enough knowledge about distributed consensus algorithms such as Raft, you can skip this section.

In order to avoid single point of failure, we want data to have multiple replicas and be stored on servers distributed across multiple regions. This way, when one or more servers crash, there is a chance that the data remains online and the service continues to be available, which is commonly known as high availability. Storing multiple replicas on multiple servers also provides higher overall read bandwidth, which is highly applicable to internet applications with high read/write ratios.

Taking data with three replicas as an example, if we do not consider whether the three replicas are consistent, as long as any one replica is online, we can assume that the data is available. However, this approach introduces the problem of inconsistent data among replicas, and forcibly imposes the huge design and implementation troubles on application developers. For many back-end applications, this approach has already been widely eliminated and is approaching extinction. Assuming that at least two of the three replicas are available, the consensus algorithm can make these three replicas appear to be consistently providing high availability to the outside world. Almost all major internet companies have started using consensus algorithms to achieve high availability and replica data consistency to varying degrees.

In order to provide such data consistency and high availability while ensuring the convenience of application development, consensus algorithms use a system model called a replicated state machine. The user's application is abstracted as a state machine, and the updates to the state machine state are implemented by proposals submitted by users, called proposals. The consensus algorithm ensures that when multiple users submit proposals at the same time, the proposals that are successfully received and persistently saved by the majority of replicas will be adopted (committed). On different replicas, each proposal will be assigned the same index number and recorded in an ordered data structure called the Raft Log. These adopted proposals will be applied one by one to update the state of the user's application state machine based on the established order. When the initial states of multiple state machine replicas are the same, and they are updated strictly according to the content in the log on each replica, then after the same adopted proposal in the log is executed, the state of each replica will also be the same.

As shown in the figure below, this example Raft group has three distributed replicas. The state of the state machine is an integer variable X, and each replica maintains a copy of the Raft Log through the Raft protocol. The values and order of all adopted proposals in each Raft Log replica are strictly the same. They will be applied to update the state machine state in ascending order of index value. After each replica applies the proposal record corresponding to the same index value in the Raft Log, the state of each replica will be the same.

RSM

The consistency of read operations needs to satisfy the following two points:

When a write request is adopted in the form of a proposal, the result of this write request or a result that is updated compared to it must be readable through a read operation. After a read request returns a result, when starting a new read request, the new request must return the result of the last read or a result that is updated compared to it. The main function of the consensus library is to implement consensus algorithms such as Raft and provide programming interfaces for users to easily implement their application state machines, so that users can use the read/write functions of the consensus library's state machine to operate the application state machine and its data, achieving high availability and data consistency for private application logic.

The following sections will discuss how to use the various conveniences provided by Dragonboat to easily build applications based on the Raft distributed consensus algorithm.

Terms: Raft Group: An independent entity controlled by the Raft protocol, consisting of multiple replicas that provide the consistency guarantees described in the previous section. An application can use one or more Raft groups. Each Raft group is identified by a user-set 64-bit integer called ShardID, which is globally unique within the system. Cluster Shard: Another term for a Raft group. Node: A member replica in a Raft group. Each node is identified by a user-set 64-bit integer called ReplicaID, which is unique within the Raft group. Initial Member: The original member set when a Raft group first appears. Leader: A node that plays the role of the leader as defined in the Raft protocol. Each Raft group should have a leader node, and reads and writes to the group can only be performed when the leader node is determined. Snapshot: The data obtained by fully saving the state of a state machine at a specific time, which can be used for quick recovery of the state machine's state.

State Machine: The state machine is the core of a user's application, which implements the business logic of the user's application. For example, if you want to build an in-memory key-value database like Redis, your state machine would be such a KV database. The state machine is also a major interface between the user's application and Dragonboat. The Dragonboat library interacts with the state machine through the IStateMachine or IOnDiskStateMachine interfaces implemented by the state machine to perform operations such as updating and querying the state machine's state.

IStateMachine: An ordinary state machine typically stores data in memory, which determines that its total data size is relatively small. The state of an ordinary state machine is completely reset after each node restarts, and the data stored on disk managed by it needs to be cleaned up and ignored, and then restored through a persisted Log and Snapshot. Its characteristics are that the amount of data is limited by memory size, the implementation is simple, and the throughput can reach millions of times per second, but the additional CPU, IO, and disk space costs caused by periodic snapshot Snapshot and state reconstruction after each restart are relatively large. The most common example of this type of state machine is an in-memory key-value database like Redis. This type of state machine is the type usually mentioned in the Raft paper.

Users need to implement the IStateMachine interface in the statemachine package to implement this type of ordinary state machine. The state of this type of state machine should be ensured to be empty after each restart, and Dragonboat is responsible for restoring the state machine's state through the Update and RecoverFromSnapshot methods. Note that the state of an ordinary state machine is reset after a restart, but no data is lost, and all accepted user data is contained in the persisted Snapshot and Raft Log.

IOnDiskStateMachine Based on Disk State Machine The main data of the disk-based state machine is stored on the disk, so its total data volume is relatively large. The state of the state machine is always persistently saved on the disk, and the state of the state machine is not affected after each node restarts. Dragonboat still needs to save snapshots periodically, but such snapshots only contain a small amount of metadata, and the creation cost is extremely small. Its characteristic is that compared with the ordinary state machine mentioned earlier, its additional IO overhead for saving snapshots is extremely small, and there is no need to rebuild the state after restarting. A more common example of this type of state machine is a distributed KV database with multiple replicas and strong consistency guarantees based on RocksDB or LevelDB. This type of state machine can be considered as a special type of state machine mentioned in Section 5.2 of the Raft paper.

Users need to implement the IOnDiskStateMachine interface in the statemachine package to implement the disk-based state machine. Users are responsible for the persistent storage of their state machine states and support for concurrent reading and writing of their state machines. At the same time, they need to save the index value of the last applied log. Dragonboat is only responsible for opening and enabling the saved state machine after each restart.

Selection of Two Types of State Machines The most important indicator for choosing the above two types of state machines is the total data size managed by the state machine. When all state machine data can be stored in memory, such as within tens of gigabytes, it is recommended to use an in-memory state machine. The disk-based state machine can be regarded as an optimization for avoiding extra overhead for managing large amounts of data by the state machine.

Node Startup Before using a node, it needs to be started to be loaded and managed by the NodeHost. The StartReplica, StartConcurrentReplica, and StartOnDiskReplica methods of the NodeHost are used to start the corresponding nodes.

When the initial members of a Raft shard are started for the first time, users need to provide all initial member information for that Raft shard, and each replica must be started with exactly the same initial member information. This initial member information is used to ensure that each replica evolves from a consistent member list to subsequent member changes requested by the user. When a replica is not an initial member of the Raft shard, but a node added later through a member change (such as SyncRequestAddReplica), it does not need to provide initial member information when it is started for the first time, only need to set the join parameter to true.

When a node restarts, whether it is an initial node or a node added later through a member change, it does not need to provide initial member information again, nor does it need to set the join parameter to true.

kolinfluence commented 1 year ago

Node Stop Users can stop the replicas of a specified Raft shard managed by the NodeHost using the StopShard method. After stopping, the node will no longer respond to read and write requests but can be restarted using the above-mentioned node startup method.

After a replica has been requested to stop using StopShard, if it is creating or restoring a snapshot, the node may not immediately stop and may have to wait until the snapshot creation or restoration is complete. To avoid this long wait, the user's implemented snapshot creation and restoration methods provide a parameter <-chan struct{} that will be closed when the node is requested to stop. The user's snapshot creation and restoration method can use this to decide whether to abandon the current snapshot creation and restoration and quickly respond to the node stop request.

Write Operation A write operation on the state machine is called a proposal. Users can initiate asynchronous or synchronous proposals using NodeHost's Propose and SyncPropose methods.

As common knowledge of the consensus algorithm, a proposal submitted by a user can only be accepted and finally sent to the replicas of all state machines for execution if the majority of the members of a Raft group are online and can exchange network messages with each other. Dragonboat provides the already-accepted proposal to the state machine interface's Update method for sequential execution.

NodeHost's Propose method starts an asynchronous proposal and returns immediately. If successful, it provides a RequestState object, which the user can use to wait for the proposal's operation result and obtain the result status. A successfully initiated proposal can result in success, timeout, or failure. Success means that the proposal has been accepted and applied to the current node in the NodeHost. All subsequent read operations will be able to read the execution result or updated result of the proposal. Timeout indicates that the entire proposal process cannot be completed within the specified time, and the proposal status is unknown, which is a typical three-state situation in distributed systems. Failure means that the parameters provided to Propose are invalid or the node has been closed before the proposal process is complete.

NodeHost's SyncPropose method starts a synchronous proposal, and the caller's goroutine is suspended until SyncPropose returns a clear success, timeout, or error result. It is currently a wrapper for the Propose method, and the difference between the two implementations can be seen by examining the source code.

Because write operation timeouts can lead to the above three-state situations, when users retry a timed-out write request, they need to consider this situation carefully to ensure that if the timed-out operation has already executed a write operation successfully, the re-execution of the write operation will not have a negative impact on the system. This feature is called idempotence. Obviously, users can choose to implement their own idempotent processing methods in the state machine design. Users can also choose to use Dragonboat's built-in idempotent solution for regular state machines, which is based on the Raft paper.

Simply put, the Propose and SyncPropose methods both require an input parameter called Session, which is a session used by the client to write to the state. When the user chooses not to use Dragonboat's built-in idempotence feature, they can use the GetNoOPSession method of NodeHost to obtain a NOOP session, which is only used to indicate the ShardID of the Raft group that the write operation is targeting. If built-in idempotence support is selected, GetNewSession can be used to obtain a valid concrete Session object, which is used each time the current client calls the Propose or SyncPropose method using this Session object. After a successful Propose or SyncPropose, the ProposalCompleted method of the Session needs to be called, while if Propose or SyncPropose times out, the ProposalCompleted method is not called, and Propose or SyncPropose is called again directly to retry the timed-out proposal.

Disk-based state machines do not support this built-in idempotence feature and must use a NOOP session. If necessary, users must implement their own idempotence support within the state machine.

The user input data for write operations should be the only source of input data for updating the state machine. When the Update method is called to update the state machine state, uncertain data sources such as system time, random numbers, and current process IDs should not be used on various replicas.

By default, a Proposal only notifies the client after it has been confirmed to be accepted and successfully applied to the state machine. For applications with special needs, Dragonboat can be configured through the NotifyCommit parameter of NodeHostConfig to notify the client separately after a Proposal has been accepted.

Reading Operations Reading operations on the state machine are usually used to query the contents and status of the state machine, and the reading operation does not change the state of the state machine. The reading operation must use a protocol that ensures consistency and should not directly read the contents of the state machine. Dragonboat uses a protocol called ReadIndex, described in the Raft paper, to implement efficient and consistent reading. Users can use NodeHost's SyncRead or a combination of ReadIndex and ReadLocalNode to complete reading of the state machine, the former being synchronous and the latter being asynchronous.

Similar to write operations, the successful execution of the ReadIndex protocol that reading operations depend on requires more than half of the members of a Raft group to be online and able to exchange network messages normally. Dragonboat provides users with a query content by calling the Lookup method of the state machine interface and returns the query result after execution.

The ReadIndex method of NodeHost starts an asynchronous execution of the ReadIndex protocol and immediately returns a RequestState object. Users can use it to wait for the successful execution of the ReadIndex protocol. Once successful, the user can immediately use the ReadLocalNode method to begin a query operation on the same node. Similar to the Propose method mentioned earlier, a successfully initiated ReadIndex protocol may have three outcomes: success, timeout, and failure. The meanings of timeout and failure are similar to those of Propose. The ReadLocalNode method can only be called after each successful ReadIndex, and calling ReadLocalNode directly will destroy the consistency guarantee of the results obtained.

The SyncRead method of NodeHost performs a synchronous read operation. This method only returns after SyncRead has returned a clear query result or confirmed a timeout or failure. It is a wrapper for the ReadIndex and ReadLocalNode methods, and the specific implementation can be found in the source code.

Because reading operations do not change the state machine state, there is no problem of duplicate writing updates after a reading operation timeout, and therefore no consideration of idempotent operations is needed.

The aforementioned reading operations use interface{} as the input and output data, which will cause additional heap allocations. For intensive reading applications, users can implement the optional IExtended interface in the state machine and use NAReadLocalNode to perform read operations, which uses []byte as the input and output data and can avoid additional heap allocations.

NodeHost also provides a function called StaleRead, which, as its method name suggests, does not guarantee any consistency and is only used to read the current state of the state machine.

Membership Change For a Raft group, membership change can modify the composition of its member nodes, such as adding a new node or removing a failed node. Membership change is transparent to the user's state machine and requires no action from the user within the state machine. The user only needs to use the RequestAddReplica and RequestDeleteReplica methods of NodeHost to add or remove nodes to complete the membership change.

The following are the notes on membership change:

The operation of membership change is also a proposal of the Raft protocol. Like ordinary user proposals, the adopted membership change operation will be persistently stored and replicated to various replica nodes. Nodes that have been deleted are not allowed to be added back to the Raft group. Membership change changes the record of the system's membership information for the Raft group. The nodes added or deleted need to be actually started or stopped by the user. For each Raft group, membership change needs to be performed one by one, and only one membership change request waiting to be completed can exist at a time. Membership change operations executed one by one by the same thread are idempotent. For example, if the request to add a node times out and the operation is retried, no unexpected results will occur. After a node is removed through membership change, the RemoveData method of NodeHost can be used to delete all data of the node to free up disk space. This operation needs to be used with caution. Cleaning up data of nodes that have not been removed through membership change using RemoveData will cause irreparable damage to the Raft group. In most cases, the user application only executes membership change operations one by one in a single thread, and the membership change operations are idempotent in this case. If this condition cannot be met, before all membership change requests (including retries), the GetShardMembership method can be used to first obtain the current Membership member record of the Raft group. After the user software makes a membership change decision based on the current membership situation, when calling the membership change API, provide the ConfigChangeID value returned by the previous step to ensure the idempotency of membership change executed concurrently by multiple threads.

Snapshot A snapshot is a saved state of a node at a specific point in time. The snapshot contains the following information:

The sequence number of proposals that have been executed The current membership information of the Raft group The current active user session information The current state of the user state machine Disk-based state machines do not support user sessions, and their current state is always on disk. Therefore, neither of the above two items will be included in snapshots that are generated periodically or requested by users through the RequestSnapshot interface. Disk-based state machines only include a complete state machine state in a snapshot when a snapshot is transmitted in real-time to a node that falls behind in progress or when a snapshot is exported based on a user request.

Snapshots can be generated in the following ways:

Setting the frequency of periodic snapshot generation through a Config object, where periodic refers to the state machine executing N updates. Requesting the generation of a snapshot by calling the RequestSnapshot method of NodeHost. Snapshots have the following functions:

After a regular state machine restarts, the memory data is lost and the state is reset. Using a snapshot can quickly restore the state machine state without executing a large number of historical Raft Log records one by one, which can usually save execution time and disk read bandwidth. When a node in the Raft group falls significantly behind the leader node, the leader can send a snapshot to the lagging node's network instead of replicating each proposal one by one, which can usually save transmission time and bandwidth. After the regular state machine state is saved to a snapshot, all historical logs before that node can be cleared to release disk storage space. Exported snapshots can be used to repair Raft groups where a majority of nodes have been permanently lost. The SaveSnapshot method of the state machine interface is used to save the state machine state to an io.Writer object, and the RecoverFromSnapshot method is responsible for restoring the snapshot data provided in the form of io.Reader to complete the snapshot application. Dragonboat is responsible for maintaining and applying other information in the snapshot, such as current membership and session information.

As stated in the description of state machine consistency, when a state machine executes a specific proposal, the state of all replicas should be the same. This must also be considered when generating snapshots. After the state machine executes a specific proposal, the snapshot is generated. Snapshots generated by multiple replicas of the state machine after executing the same specific proposal should be strictly identical.

Users can set how often to save a snapshot by specifying the SnapshotEntries field in the Config instance provided when starting each node. Typically:

For a regular state machine, it is recommended to save 1-2 snapshots per day. For a disk-based state machine, it is recommended to save a snapshot every hour. Users can also use the RequestSnapshot method of the NodeHost to request the creation of a snapshot for a specific Raft group. The snapshot created using the DefaultSnapshotOption is a regular snapshot managed by the system. If the Exported value of the SnapshotOption is set to true, the created snapshot will be exported to the directory specified by ExportPath. The exported snapshot can be used for future backup and restoration of a Raft group that has lost a majority of its nodes. The user is solely responsible for saving, transferring, and releasing/cleaning up the snapshot exported to the specified directory. The system will no longer intervene. For non-exported snapshots requested, users can also control how much log data needs to be cleaned up to release disk space after the snapshot is generated by setting the OverrideCompactionOverhead and CompactionOverhead values of the SnapshotOption.

The exported snapshot mentioned above can be used to repair an unusable Raft group using the ImportSnapshot method provided in the tools package, after a majority of nodes have permanently failed. At this point, data loss has occurred, and the operation is a data loss operation. Users should avoid data loss by setting a reasonable number of replicas and strengthening server monitoring and maintenance to prevent a majority of nodes from permanently failing. Note that recoverable failures of a majority of nodes, such as restarts or brief network failures, will not result in data loss. Permanently failing nodes refer to faults such as disk damage on a majority of servers or permanent unavailability of servers. Please refer to the godoc documentation for specific usage of ImportSnapshot.

Gossip By default, when a replica of a Raft group is added to the system, its location at the RaftAddress of its node must be explicitly specified by the user to ensure that Raft messages can be sent to the replica. This approach is simple and direct, but the drawback is that the RaftAddress must be fixed and unchanging, which requires the use of a fixed IP address or a DNS name maintained by the user. When this requirement cannot be met, the gossip function can be used to circumvent this problem.

Starting from version v3.3, each NodeHost node is assigned a permanent and immutable NodeHostID value randomly, in the form of nhid-1234567890, which can be returned by the ID method of the NodeHost. When the AddressByNodeHostID item in NodeHostConfig is set to true, all newly created replicas need to be specified with their corresponding NodeHostID value. After that, the RaftAddress value used for communication between NodeHost instances can change arbitrarily each time a NodeHost instance is started, and the correspondence between the RaftAddress of each NodeHost instance and its NodeHostID will be dynamically maintained by a background gossip service. When a Raft message needs to be transmitted between two replicas, the ReplicaID is first converted to the corresponding NodeHostID, and then the RaftAddress corresponding to the NodeHostID is obtained through the gossip service to complete the transmission of the message between replicas.

The gossip service itself is a fully distributed network service, and users only need to set its relevant address parameters through the NodeHostConfig.Gossip item.

Other features Dragonboat provides the following commonly used features through NodeHost:

Non-voting nodes. Observer nodes do not participate in leader election or the acceptance of proposals. They are only used to receive and execute accepted proposals in the Raft group. The observer node's state machine is the same as that of a regular node, and it normally has the same complete and identical state machine state. It can be used as an additional read-only node for users to read the state machine state with consistency guarantees. The other major role of the observer node is to allow a new node to join the Raft group as an observer and gradually obtain all the state machine states before promoting it to a normal node. Initiating a SyncRead or GetShardMembership on the NodeHost where the observer node is located, if successful, indicates that the ReadIndex protocol has completed one round, indicating that the observer node has acquired almost all of the Log Entries and is ready to be promoted to a normal node. Leader transfer. Normally, the leader is elected transparently by the user program. Users can use the RequestLeaderTransfer method provided by NodeHost to attempt to transfer the leader to a specified node. NodeHost also provides GetNodeHostInfo and GetShardMembership methods for querying information about the various Raft groups managed by each NodeHost.

kolinfluence commented 1 year ago

@lni My help with translation of https://github.com/lni/dragonboat/blob/master/docs/overview.CHS.md

to english. pls update or i can help u with maintaining the english version too

thx for the great work!