mkabilov / pg2ch

Data streaming from postgresql to clickhouse via logical replication mechanism
MIT License
195 stars 35 forks source link
clickhouse go golang logical-replication postgresql

PostgreSQL to ClickHouse

Continuous data transfer from PostgreSQL to ClickHouse using logical replication mechanism.

Status of the project

Currently pg2ch tool is in active testing stage, as for now it is not for production use

Getting and running

Get:

    go get -u github.com/mkabilov/pg2ch

Run:

    pg2ch --config {path to the config file (default config.yaml)}

Config file

tables:
    {postgresql table name}:
        main_table: {clickhouse table name}
        buffer_table: {clickhouse buffer table name} # optional, if not specified, insert directly to the main table
        buffer_row_id: {clickhouse buffer table column name for row id} 
        init_sync_skip: {skip initial copy of the data}
        init_sync_skip_buffer_table: {if true bypass buffer_table and write directly to the main_table on initial sync copy}
                                     # makes sense in case of huge tables        
        init_sync_skip_truncate: {skip truncate of the main_table during init sync}                                 
        engine: {clickhouse table engine: MergeTree, ReplacingMergeTree or CollapsingMergeTree}
        max_buffer_length: {number of DML(insert/update/delete) commands to store in the memory before flushing to the buffer/main table } 
        merge_threshold: {if buffer table specified, number of buffer flushed before moving data from buffer to the main table}
        columns: # postgres - clickhouse column name mapping, 
                 # if not present, all the columns are expected to be on the clickhouse side with the exact same names 
            {postgresql column name}: {clickhouse column name}
        is_deleted_column: # in case of ReplacingMergeTree 1 will be stored in the {is_deleted_column} in order to mark deleted rows
        sign_column: {clickhouse sign column name for CollapsingMergeTree engines only, default "sign"}
        ver_column: {clickhouse version column name for the ReplacingMergeTree engine, default "ver"}

inactivity_merge_timeout: {interval, default 1 min} # merge buffered data after that timeout

clickhouse: # clickhouse tcp protocol connection params
    host: {clickhouse host, default 127.0.0.1}
    port: {tcp port, default 9000}
    database: {database name}
    username: {username}
    password: {password}
    params:
        {extra param name}:{extra param value}
        ...

postgres: # postgresql connection params
    host: {host name, default 127.0.0.1}
    port: {port, default 5432}
    database: {database name}
    user: {user}
    replication_slot_name: {logical replication slot name}
    publication_name: {postgresql publication name}

db_path: {path to the persistent storage dir where table lsn positions will be stored}

Sample setup:

CREATE TABLE pgbench_accounts_buf (aid Int32, abalance Int32, sign Int8, row_id UInt64) ENGINE = Memory() -- will be used as a buffer table

- create `config.yaml` file with the following content:
```yaml
tables:
    pgbench_accounts:
        main_table: pgbench_accounts
        buffer_table: pgbench_accounts_buf
        buffer_row_id: row_id
        engine: CollapsingMergeTree
        max_buffer_length: 1000
        merge_threshold: 4
        columns:
            aid: aid
            abalance: abalance
        sign_column: sign

inactivity_merge_timeout: '10s'

clickhouse:
    host: localhost
    port: 9000
    database: default
    username: default
postgres:
    host: localhost
    port: 5432
    database: pg2ch_test
    user: postgres
    replication_slot_name: pg2ch_slot
    publication_name: pg2ch_pub

db_path: db