pgEdge / spock

Logical Multi-master Replication
https://github.com/pgedge/pgedge
Other
163 stars 14 forks source link

Spock

Spockbench tests

Multi-Master Replication with Conflict Resolution & Avoidance

This SPOCK extension provides multi-master replication for PostgreSQL 14+. We originally leveraged the pgLogical and BDR2 projects as a solid foundation to build upon for this enterprise-class extension.

Version 4.0 is our current version under active development. It presently includes the following important enhancements beyond v3.3:

Our current production version is v3.3 and includes the following enhancements over v3.2:

Our previous production version was v3.2 and includes the following important enhancements beyond Spock v3.1:

Our initial production version was 3.1 and included the following:

Our beta version was 3.0 and includes the following important enhancements beyond its bdr/pg_logical base:

We use the following terms, borrowed from Jan's well known Slony project, to describe data streams between nodes:

Use cases supported are:

Architectural details:

Major New Features

Snowflake Sequences

Snowflake Sequences is a PostgreSQL extension providing an int8 and sequence based unique ID solution to optionally replace the PostgreSQL built-in bigserial data type. This extension allows Snowflake IDs that are unique within one sequence across multiple PostgreSQL instances in a distributed cluster.

Automatic Replication of DDL

DDL statements can now be automatically replicated. This feature can be enabled by setting the following to on: spock.enable_ddl_replication, spock.include_ddl_repset, and spock.allow_ddl_from_functions. It is recommended to set these to on only when the database schema matches exactly on all nodes- either when all databases have no objects, or when all databases have exactly the same objects and all tables are added to replication sets.

By default, these settings are set to off. When these settings are on, it is recommended that DDL statements dangerous for replication be executed in a maintenance window to avoid errors that will impact replication.

spock.enable_ddl_replication will enable replication of ddl statements through the default replication set. Some DDL statements are intentionally not replicated (ie. CREATE DATABASE), and some are replicated but could cause issues in two ways. Some DDL statements could lead to inconsistent data (ie. CREATE TABLE... AS...) since the DDL statement is replicated before the table is added to the replication set. Some DDL statements are replicated, but are potentially an issue in a 3+ node cluster (ie. DROP TABLE).

spock.include_ddl_repset will enable spock to automatically add tables to replication sets at the time they are created on each node. Tables with Primary Keys will be added to the default replication set, and tables without Primary Keys will be added to the default_insert_only replication set. Altering a table to add or remove a Primary Key will make the correct adjustment to which replication set the table is part of. Setting a table to unlogged will remove it from replication. Detaching a partition will not remove it from replication.

spock.allow_ddl_from_functions will enable spock to automatically replicate DDL statements that are called within functions to also be automatically replicated. This can be turned off if these functions are expected to run on every node.When this is set to off statements replicated from functions adhere to the same rule previously described for 'include_ddl_repset.' If a table possesses a defined primary key, it will be added into the 'default' replication set; alternatively, they will be added to the 'default_insert_only' replication set.

During the auto replication process, various messages are generated to provide information about the execution. Here are the descriptions for each message:

Replication of Partitioned Tables

Partitioned tables can now be replicated. By default, when adding a partitioned table to a replication set, it will include all of its present partitions. The later partitions can be added using the partition_add function. The DDL for the partitioned and partitions should be present on the subscriber nodes (same as for normal tables).

Similarly, when removing a partitioned table from a replication set, by default, the partitions of the table will also be removed.

The replication of partitioned tables is a bit different from normal tables. When doing initial synchronization, we query the partitioned table (or parent) to get all the rows for synchronization purposes and don't synchronize the individual partitions. However, after the initial sync of data, the normal operations resume i.e. the partitions start replicating like normal tables.

It's possible to add individual partitions to the replication set in which case they will be replicated like regular tables (to the table of the same name as the partition on the subscriber). This has performance advantages when partitioning definition is the same on both provider and subscriber, as the partitioning logic does not have to be executed.

Note: There is an exception to individual partition replication, which is, the individual partitions won't sync up the existing data. It's equivalent to setting synchronize_data = false.

When partitions are replicated through a partitioned table, the exception is the TRUNCATE command which always replicates with the list of affected tables or partitions.

Additionally, row_filter can also be used with partitioned tables, as well as with individual partitions.

Conflict-Free Delta-Apply Columns (Conflict Avoidance)

Logical Multi-Master replication can get itself into trouble on running sums (such as a YTD balance). Unlike other solutions, we do NOT have a special data type for this. Any numeric data type will do (including numeric, float, double precision, int4, int8, etc).

Suppose that a running bank account sum contains a balance of $1,000. Two transactions "conflict" because they overlap with each from two different multi-master nodes. Transaction A is a $1,000 withdrawal from the account. Transaction B is also a $1,000 withdrawal from the account. The correct balance is $-1,000. Our Delta-Apply algorithm fixes this problem and highly conflicting workloads with this scenario (like a tpc-c like benchmark) now run correctly at lightning speeds.

This feature is powerful AND simple in its implementation as follows:

Note that on a conflicting transaction, the delta column will get correctly calculated and applied. The configured conflict resolution strategy applies to non-delta columns (normally last-update-wins).

As a special safety-valve feature. If the user ever needs to re-set a log_old_value column you can temporarily alter the column to "log_old_value" is false.

Conflicts Overview

In case the node is subscribed to multiple providers, or when local writes happen on a subscriber, conflicts can arise for the incoming changes. These are automatically detected and can be acted on depending on the configuration.

The configuration of the conflicts resolver is done via the spock.conflict_resolution setting.

The resolved conflicts are logged using the log level set using spock.conflict_log_level. This parameter defaults to LOG. If set to lower level than log_min_messages the resolved conflicts won't appear in the server log.

Conflict Configuration options

Some aspects of Spock can be configured using configuration options that can be either set in postgresql.conf or via ALTER SYSTEM SET.

Requirements

The spock extension must be installed on both provider and subscriber. You must CREATE EXTENSION spock on both. For major version upgrades, the old node can be running a recent version of pgLogical2 before it is upgraded to become a Spock node.

Tables on the provider and subscriber must have the same names and be in the same schema. Future revisions may add mapping features.

Tables on the provider and subscriber must have the same columns, with the same data types in each column. CHECK constraints, NOT NULL constraints, etc., must be the same or weaker (more permissive) on the subscriber than the provider.

Tables must have the same PRIMARY KEYs. It is not recommended to add additional UNIQUE constraints other than the PRIMARY KEY (see below).

Some additional requirements are covered in Limitations and Restrictions.

Usage

This section describes basic usage of the Spock replication extension.
It should be noted the pgEdge, when you install the Spock extension, does this quick setup for you (and more).

Quick setup

First the PostgreSQL server has to be properly configured to support logical decoding:

wal_level = 'logical'
max_worker_processes = 10   # one per database needed on provider node
                            # one per node needed on subscriber node
max_replication_slots = 10  # one per node needed on provider node
max_wal_senders = 10        # one per node needed on provider node
shared_preload_libraries = 'spock'
track_commit_timestamp = on # needed for conflict resolution

pg_hba.conf has to allow logical replication connections from localhost. Logical replication connections are treated by pg_hba.conf as regular connections to the provider database.

Next the spock extension has to be installed on all nodes in the database to be replicated:

CREATE EXTENSION spock;

Now create the provider node:

SELECT spock.node_create(
    node_name := 'provider1',
    dsn := 'host=providerhost port=5432 dbname=db'
);

Add all tables in public schema to the default replication set.

SELECT spock.repset_add_all_tables('default', ARRAY['public']);

Optionally you can also create additional replication sets and add tables to them (see Replication sets).

It's usually better to create replication sets before subscribing so that all tables are synchronized during initial replication setup in a single initial transaction. However, users of bigger databases may instead wish to create them incrementally for better control.

Once the provider node is setup, subscribers can be subscribed to it. First the subscriber node must be created:

SELECT spock.node_create(
    node_name := 'subscriber1',
    dsn := 'host=thishost port=5432 dbname=db'
);

And finally on the subscriber node you can create the subscription which will start synchronization and replication process in the background:

SELECT spock.sub_create(
    subscription_name := 'subscription1',
    provider_dsn := 'host=providerhost port=5432 dbname=db'
);

SELECT spock.sub_wait_for_sync('subscription1');

Creating subscriber nodes with base backups

In addition to the SQL-level node and subscription creation, spock also supports creating a subscriber by cloning the provider with pg_basebackup and starting it up as a spock subscriber. This is done with the spock_create_subscriber tool; see the --help output.

Unlike spock.sub_create's data sync options, this clone ignores replication sets and copies all tables on all databases. However, it's often much faster, especially over high-bandwidth links.

Node management

Nodes can be added and removed dynamically using the SQL interfaces.

spock-node-create

spock-node-drop

spock-node-add-interface

spock-node-drop-interface

Subscription management

spock-sub-create

spock-sub-drop

spock-sub-disable

spock-sub-enable

spock-sub-alter-interface

spock-sub-sync

spock-sub-resync-table

spock-sub-wait-for-sync

spock-sub-wait-table-sync

spock-sub-show-status

spock-sub-show-table

spock-sub-add-repset

spock-sub-remove-repset

There is also a postgresql.conf parameter, spock.extra_connection_options, that may be set to assign connection options that apply to all connections made by spock. This can be a useful place to set up custom keepalive options, etc.

spock defaults to enabling TCP keepalives to ensure that it notices when the upstream server disappears unexpectedly. To disable them add keepalives = 0 to spock.extra_connection_options.

Replication sets

Replication sets provide a mechanism to control which tables in the database will be replicated and which actions on those tables will be replicated.

Each replicated set can specify individually if INSERTs, UPDATEs, DELETEs and TRUNCATEs on the set are replicated. Every table can be in multiple replication sets and every subscriber can subscribe to multiple replication sets as well. The resulting set of tables and actions replicated is the union of the sets the table is in. The tables are not replicated until they are added into a replication set.

There are three preexisting replication sets named "default", "default_insert_only" and "ddl_sql". The "default" replication set is defined to replicate all changes to tables in it. The "default_insert_only" only replicates INSERTs and is meant for tables that don't have primary key (see Limitations section for details). The "ddl_sql" replication set is defined to replicate schema changes specified by spock.replicate_ddl

The following functions are provided for managing the replication sets:

spock-repset-create

spock-repset-alter

spock-repset-drop

spock-repset-add-table

spock-repset-add-all-tables

spock-repset-remove-table

spock-repset-add-seq

Warning: For a multi master system, adding sequences to replication sets is not recomended. Use our new Snowflake Sequences instead.

spock-repset-add-all-seqs

Warning: For a multi master system, adding sequences to replication sets is not recomended. Use our new Snowflake Sequences instead.

spock-repset-remove-seq

You can view the information about which table is in which set by querying the spock.tables view.

Automatic assignment of replication sets for new tables

The event trigger facility can be used for describing rules which define replication sets for newly created tables.

Example:

CREATE OR REPLACE FUNCTION spock_assign_repset()
RETURNS event_trigger AS $$
DECLARE obj record;
BEGIN
    FOR obj IN SELECT * FROM pg_event_trigger_ddl_commands()
    LOOP
        IF obj.object_type = 'table' THEN
            IF obj.schema_name = 'config' THEN
                PERFORM spock.repset_add_table('configuration', obj.objid);
            ELSIF NOT obj.in_extension THEN
                PERFORM spock.repset_add_table('default', obj.objid);
            END IF;
        END IF;
    END LOOP;
END;
$$ LANGUAGE plpgsql;

CREATE EVENT TRIGGER spock_assign_repset_trg
    ON ddl_command_end
    WHEN TAG IN ('CREATE TABLE', 'CREATE TABLE AS')
    EXECUTE PROCEDURE spock_assign_repset();

The above example will put all new tables created in schema config into replication set configuration and all other new tables which are not created by extensions will go to default replication set.

Additional functions

spock-replicate-ddl

spock-seq-sync

Row Filtering

Spock allows row based filtering both on provider side and the subscriber side.

Row Filtering on Provider

On the provider the row filtering can be done by specifying row_filter parameter for the spock.repset_add_table function. The row_filter is normal PostgreSQL expression which has the same limitations on what's allowed as the CHECK constraint.

Simple row_filter would look something like row_filter := 'id > 0' which would ensure that only rows where values of id column is bigger than zero will be replicated.

It's allowed to use volatile function inside row_filter but caution must be exercised with regard to writes as any expression which will do writes will throw error and stop replication.

It's also worth noting that the row_filter is running inside the replication session so session specific expressions such as CURRENT_USER will have values of the replication session and not the session which did the writes.

Row Filtering on Subscriber

On the subscriber the row based filtering can be implemented using standard BEFORE TRIGGER mechanism.

It is required to mark any such triggers as either ENABLE REPLICA or ENABLE ALWAYS otherwise they will not be executed by the replication process.

Synchronous Replication

Synchronous replication is supported using same standard mechanism provided by PostgreSQL for physical replication.

The synchronous_commit and synchronous_standby_names settings will affect when COMMIT command reports success to client if spock subscription name is used in synchronous_standby_names. Refer to PostgreSQL documentation for more info about how to configure these two variables.

The batch inserts will improve replication performance of transactions that did many inserts into one table. Spock will switch to batch mode when transaction did more than 5 INSERTs.

It's only possible to switch to batch mode when there are no INSTEAD OF INSERT and BEFORE INSERT triggers on the table and when there are no defaults with volatile expressions for columns of the table. Also the batch mode will only work when spock.conflict_resolution is set to error.

The default is true.

Limitations and restrictions

Superuser is required

Currently spock replication and administration requires superuser privileges. It may be later extended to more granular privileges.

UNLOGGED and TEMPORARY not replicated

UNLOGGED and TEMPORARY tables will not and cannot be replicated, much like with physical streaming replication.

One database at a time

To replicate multiple databases you must set up individual provider/subscriber relationships for each. There is no way to configure replication for all databases in a PostgreSQL install at once.

PRIMARY KEY or REPLICA IDENTITY required

UPDATEs and DELETEs cannot be replicated for tables that lack a PRIMARY KEY or other valid replica identity such as using an index, which must be unique, not partial, not deferrable, and include only columns marked NOT NULL. Replication has no way to find the tuple that should be updated/deleted since there is no unique identifier. REPLICA IDENTITY FULL is not supported yet.

Only one unique index/constraint/PK

If more than one upstream is configured or the downstream accepts local writes then only one UNIQUE index should be present on downstream replicated tables. Conflict resolution can only use one index at a time so conflicting rows may ERROR if a row satisfies the PRIMARY KEY but violates a UNIQUE constraint on the downstream side. This will stop replication until the downstream table is modified to remove the violation.

It's fine to have extra unique constraints on an upstream if the downstream only gets writes from that upstream and nowhere else. The rule is that the downstream constraints must not be more restrictive than those on the upstream(s).

Partial secondary unique indexes are permitted, but will be ignored for conflict resolution purposes.

Unique constraints must not be deferrable

On the downstream end spock does not support index-based constraints defined as DEFERRABLE. It will emit the error

ERROR: spock doesn't support index rechecks needed for deferrable indexes
DETAIL: relation "public"."test_relation" has deferrable indexes: "index1", "index2"

if such an index is present when it attempts to apply changes to a table.

DDL

Automatic DDL replication is not supported. Managing DDL so that the provider and subscriber database(s) remain compatible is the responsibility of the user.

spock provides the spock.replicate_ddl function to allow DDL to be run on the provider and subscriber at a consistent point.

No replication queue flush

There's no support for freezing transactions on the master and waiting until all pending queued xacts are replayed from slots. Support for making the upstream read-only for this will be added in a future release.

This means that care must be taken when applying table structure changes. If there are committed transactions that aren't yet replicated and the table structure of the provider and subscriber are changed at the same time in a way that makes the subscriber table incompatible with the queued transactions replication will stop.

Administrators should either ensure that writes to the master are stopped before making schema changes, or use the spock.replicate_ddl function to queue schema changes so they're replayed at a consistent point on the replica.

Once multi-master replication support is added then using spock.replicate_ddl will not be enough, as the subscriber may be generating new xacts with the old structure after the schema change is committed on the publisher. Users will have to ensure writes are stopped on all nodes and all slots are caught up before making schema changes.

FOREIGN KEYS

Foreign keys constraints are not enforced for the replication process - what succeeds on provider side gets applied to subscriber even if the FOREIGN KEY would be violated.

TRUNCATE

Using TRUNCATE ... CASCADE will only apply the CASCADE option on the provider side.

(Properly handling this would probably require the addition of ON TRUNCATE CASCADE support for foreign keys in PostgreSQL).

TRUNCATE ... RESTART IDENTITY is not supported. The identity restart step is not replicated to the replica.

Sequences

We strongly recommend that you use our new Snowflake Sequences rather than using the legacy sequences described below.

The state of sequences added to replication sets is replicated periodically and not in real-time. Dynamic buffer is used for the value being replicated so that the subscribers actually receive future state of the sequence. This minimizes the chance of subscriber's notion of sequence's last_value falling behind but does not completely eliminate the possibility.

It might be desirable to call sync_sequence to ensure all subscribers have up to date information about given sequence after "big events" in the database such as data loading or during the online upgrade.

It's generally recommended to use bigserial and bigint types for sequences on multi-node systems as smaller sequences might reach end of the sequence space fast.

Users who want to have independent sequences on provider and subscriber can avoid adding sequences to replication sets and create sequences with step interval equal to or greater than the number of nodes. And then setting a different offset on each node. Use the INCREMENT BY option for CREATE SEQUENCE or ALTER SEQUENCE, and use setval(...) to set the start point.

Triggers

Apply process and the initial COPY process both run with session_replication_role set to replica which means that ENABLE REPLICA and ENABLE ALWAYS triggers will be fired.

PostgreSQL Version differences

Spock can replicate across PostgreSQL major versions. Despite that, long term cross-version replication is not considered a design target, though it may often work. Issues where changes are valid on the provider but not on the subscriber are more likely to arise when replicating across versions.

It is safer to replicate from an old version to a newer version since PostgreSQL maintains solid backward compatibility but only limited forward compatibility. Initial schema synchronization is only supported when replicating between same version of PostgreSQL or from lower version to higher version.

Replicating between different minor versions makes no difference at all.

Database encoding differences

Spock does not support replication between databases with different encoding. We recommend using UTF-8 encoding in all replicated databases.

Large objects

PostgreSQL's logical decoding facility does not support decoding changes to large objects, so spock cannot replicate large objects.

Also any DDL limitations apply so extra care need to be taken when using replicate_ddl_command().

Spock Read Only

Spock supports enabling a cluster to be operated in read-only mode.

The read-only status is managed using a GUC (Grand Unified Configuration) parameter named spock.readonly. This parameter can be set to enable or disable the read-only mode. The read-only mode restricts non-superusers to read-only operations, while superusers can still perform both read and write operations regardless of the setting.

The flag is at cluster level: either all databases are read-only or all databases are read-write (the usual setting).

The read-only mode is implemented by filtering SQL statements:

This means that the databases are in read-only mode at SQL level: however, the checkpointer, background writer, walwriter, and the autovacuum launcher are still running; this means that the database files are not read-only and that in some cases the database may still write to disk.

Cluster Read-Only Mode

The cluster read-only mode can now be controlled using the GUC parameter spock.readonly. This configuration parameter allows you to set the cluster to read-only mode. Note that only a superuser can change this setting. When the cluster is set to read-only mode, non-superusers will be restricted to read-only operations, while superusers will still be able to perform read and write operations regardless of the setting.

Setting Read-Only Mode

This value can be changed using the ALTER SYSTEM command.

ALTER SYSTEM SET spock.readonly = 'on';
SELECT pg_reload_conf();

To set the cluster to read-only mode for a session, use the SET command. Here are the steps:

SET spock.readonly TO on;

To query the current status of the cluster, you can use the following SQL command:

SHOW spock.readonly;

This command will return on if the cluster is in read-only mode and off if it is not.

Notes

Spock is licensed under the pgEdge Community License v1.0