Log based mode performance

transferwise / pipelinewise

Data Pipeline Framework using the singer.io spec

https://transferwise.github.io/pipelinewise

Apache License 2.0

629 stars 116 forks source link

Log based mode performance #972

Open podviaznikov opened 2 years ago

podviaznikov commented 2 years ago

This one is a follow up on https://github.com/transferwise/pipelinewise/issues/971 but a different case.

I was also tasting LOG_BASED replication and it works but it feels that it can be faster.

Eg I've set batch_size_rows to 100000 and I see in logs that to sync each batch of 100K take 1.5 minutes. Which looks a bit slow. I there any way to speed up this? (I did try fastsync_parallelism)

Samira-El commented 2 years ago

What do you wanna speed up exactly? There are many moving parts here.

podviaznikov commented 2 years ago

I think general speed. If I have more than 100K inserts/updates/deletes per 1.5 minute sync job will start to back up.

Samira-El commented 2 years ago

Just a disclaimer that there will always be a lag, this will not do real-time native replication unless you're willing to provision powerful machines or snowflake warehouses.

You need to investigate where you're bottleneck is, the replication is both CPU and IO-bound:

Is the tap processing inserts/updates/deletes events fast enough for your needs? if not then there is no config here to change, you have to bump the CPU of the machines on which Pipelinewise runs.

But if it is fast but the Snowflake warehouse you're using is small or has other workloads running there then there might be queues in the warehouse and the batches are gonna take longer to flush and the pipeline would not be doing anything in the meantime, regardless of how fast the tap is at consuming change logs. If it's well provisioned, you could try smaller batch size or time-based batch flushing (check out the pipelinewise-target-snowflake Readme).

Tolsto commented 2 years ago

The main issue is that most of the CPU intensive operations for the log-based replication in both the tap and target components are single-threaded and cannot be easily scaled. Your only option there is to either use multiple replication slots with multiple instances of Pipelinewise each syncing different tables or use CPU cores with a higher clock speed or IPC. For better scalability we'd need process-based parallelism when processing the log in the tap and probably also for building the batches in the target. If the Snowflake operations slow you down then I'd suggest to use larger batch sizes as this will decrease your Snowflake overhead substantially. If memory management becomes a problem with larger batches have a look at this PR.

If you have hstore or array type columns in your tables then this problem here will also seriously slow you down.