mkabilov / pg2ch

Data streaming from postgresql to clickhouse via logical replication mechanism
MIT License
195 stars 35 forks source link

Clickhouse table doesn't have same counts as Postgres table (following readme example) #35

Closed mattTruebill closed 3 years ago

mattTruebill commented 3 years ago

After following the readme example, I end up with:

In Clickhouse:

SELECT SUM(abalance * sign), SUM(sign), count(0) FROM pgbench_accounts;

SELECT
    SUM(abalance * sign),
    SUM(sign),
    count(0)
FROM pgbench_accounts

┌─SUM(multiply(abalance, sign))─┬─SUM(sign)─┬─count(0)─┐
│                       -931490 │         0 │    49744 │
└───────────────────────────────┴───────────┴──────────┘

And in Postgres:

pg2ch_test=# SELECT SUM(abalance), COUNT(*) FROM pgbench_accounts;
   sum   | count  
---------+--------
 -931490 | 100000
(1 row)

I see that the sums are the same but the counts are different. Is this because only changes are replicated and, thus, only rows that underwent a change are represented in Clickhouse?

mkabilov commented 3 years ago

not really. pg2ch creates an initial copy first, and then applies changes

mattTruebill commented 3 years ago

Thanks for the quick reply, @mkabilov! TL;DR, I've run through the readme example a few times and it seems consistently the case that the initial copy isn't taking place. Could you let me know your thoughts on why this seems to be the case or if I'm missing something? Thanks so much! Details follow.

Versions

Ubuntu 20.04.1 LTS
postgres 12.5
Clickhouse server version 20.10.3 revision 54441

Setup & Test

PG_CONN=postgres://postgres:password@localhost:5432
cd go/src/github.com/mkabilov/pg2ch
go build

# PG & CH reset reset
psql $PG_CONN/postgres -c "DROP DATABASE pg2ch_test;"
psql $PG_CONN/postgres -c "select pg_drop_replication_slot('pg2ch_slot');"
clickhouse-client --password=password --query "drop table pgbench_accounts;"
clickhouse-client --password=password --query "drop table pgbench_accounts_buf;"

# PG setup
psql $PG_CONN/postgres -c "CREATE DATABASE pg2ch_test;"
pgbench -U postgres -d pg2ch_test -i
psql $PG_CONN/pg2ch_test -c "ALTER TABLE pgbench_accounts REPLICA IDENTITY FULL;"
psql $PG_CONN/pg2ch_test -c "CREATE PUBLICATION pg2ch_pub FOR TABLE pgbench_accounts;"
psql $PG_CONN/pg2ch_test -c "SELECT * FROM pg_create_logical_replication_slot('pg2ch_slot', 'pgoutput');"

# Clickhouse setup
clickhouse-client --password=password --query="CREATE TABLE pgbench_accounts (aid Int32, abalance Int32, sign Int8) ENGINE = CollapsingMergeTree(sign) ORDER BY aid"
clickhouse-client --password=password --query="CREATE TABLE pgbench_accounts_buf (aid Int32, abalance Int32, sign Int8, row_id UInt64) ENGINE = Memory()"

# Begin pg2ch
./pg2ch --config config.yaml > log.txt 2> log_err.txt &

# Make changes
pgbench -U postgres -d pg2ch_test --time 30 --client 10
sleep 15

# Check CH & PG
psql $PG_CONN/pg2ch_test -c "SELECT SUM(abalance), COUNT(*) FROM pgbench_accounts" > pg_result.txt
clickhouse-client --password=password --query="SELECT SUM(abalance * sign), SUM(sign) FROM pgbench_accounts" > ch_results.txt

Results

Here's the stderr output of pg2ch (stdout output is empty):

2021/01/20 15:22:48 consuming changes for table pgbench_accounts starting from 0/1560E890 lsn position
2021/01/20 15:22:48 generation_id: 51209
2021/01/20 15:22:48 Starting from 0/1560E890 lsn
2021/01/20 15:23:48 FlushToMainTable for pgbench_accounts pg table processed in 0s (rows: 51928)

Here are the contents of pg_result.txt:

   sum   | count  
---------+--------
 -303129 | 100000
(1 row)

Here are the contents of ch_result.txt:

-303129 0   51928
mattTruebill commented 3 years ago

Just wanted to follow-up with a naive tracing of what's going on... I could totally be wrong and appreciate your help in figuring out if this is user-error on my part or a bug or something else :).

It looks like an initial copy of a table will only occur if there's no tableLSN for that table (see here) Sure enough, every time I start pg2ch, there's always a value for tableLSN for the pgbench_accounts table. This would seem to explain the mechanism by which the initial copy seems to not occur?

All that said, I don't know what's expected; is there supposed to be no tableLSN value there until after the initial copy? Something else?

On a separate note, some results that would seem to support that everything besides the initial copy is working as expected follow. I ran pgbench a few times. After each run, I compared results between PG and CH. PG and CH both showed always the same sum. PG count was always 100000. CH counts increased between each run, presumably because different rows were being updated on each run and so those records would get replicated over to CH.

To wit:

#After 1st run of pgbench:
pg_sum: -195275
ch_sum: -195275
pg_count: 100000
ch_count: 20430

#After 2nd run of pgbench:
pg_sum: -55540
ch_sum: -55540
pg_count: 100000
ch_count: 38968

#After 3rd run of pgbench:
pg_sum: -221687
ch_sum: -221687
pg_count: 100000
ch_count: 57610

Probably worth noting that those sums are only matching because the value of abalance is initially 0 for all records in pgbench_accounts. To test that hypothesis, I ran pgbench prior to creating the replication slot to get some non-0 values for abalance. After then turning on replication and running pgbench, I got different sums for PG and CH as I suspected would happen.

mattTruebill commented 3 years ago

Hi @mkabilov,

I'm happy to report I got this working. It looks like rm -rf the pg2ch directory and doing a clean install fixed it. Might there be some stateful directory contents that make killing the process insufficient if one wants to start replication "from 0"?

mattTruebill commented 3 years ago

Meant to close w the preceding comment. Thanks @mkabilov !