slingdata-io / sling-cli

Sling is a CLI tool that extracts data from a source storage/database and loads it in a target storage/database.
https://docs.slingdata.io
GNU General Public License v3.0
399 stars 27 forks source link

Clickhouse Target High Memory Usage #312

Closed max-yan closed 3 months ago

max-yan commented 4 months ago

Issue Description

With --tgt-conn clickhouse all rows are inserted at the last moment. Memory consumption does not allow to load large tables. In progress _tmp table is empty it's not a memory leak.

same result with: --src-conn mysql, --src-conn postgres works as expected: --src-conn postgres --tgt-conn mysql (any use_bulk option)

I'm not a go developer but tried to find the problem. In ClickhouseConn::BulkImportStream only one element of ds.BatchChan is obtained and I don't understand how --tgt-conn mysql works if BatchChan filled in independent of tgt-conn place.

/usr/local/bin/sling run --src-conn mysql --src-stream "SELECT * FROM src_table_202404" --tgt-conn clickhouse --tg
t-object dst_table_20240430 --mode full-refresh -d
2024-05-29 13:07:44 DBG Sling version: 1.2.10 (linux amd64)
2024-05-29 13:07:44 DBG type is db-db
2024-05-29 13:07:44 DBG using source options: {"empty_as_null":false,"null_if":"NULL","datetime_format":"AUTO","max_decimals":11}
2024-05-29 13:07:44 DBG using target options: {"datetime_format":"auto","file_max_rows":0,"max_decimals":11,"use_bulk":true,"add_new_columns":true,"adjust_column_type":false,"column_casing":"source"}
2024-05-29 13:07:44 INF connecting to source database (mysql)
2024-05-29 13:07:44 DBG opened "mysql" connection (conn-mysql-DGF)
2024-05-29 13:07:44 INF connecting to target database (clickhouse)
2024-05-29 13:07:44 DBG opened "clickhouse" connection (conn-clickhouse-Jo4)
2024-05-29 13:07:44 INF reading from source database
2024-05-29 13:07:44 DBG SELECT * FROM src_table_202404
2024-05-29 13:07:44 INF writing to target database [mode: full-refresh]
2024-05-29 13:07:44 DBG drop table if exists `db1`.`dst_table_20240430_tmp`
2024-05-29 13:07:44 DBG table `db1`.`dst_table_20240430_tmp` dropped
2024-05-29 13:07:44 DBG create table `db1`.`dst_table_20240430_tmp` (`link_key` Nullable(Int64),
`src` Nullable(Int32),
`address` Nullable(String),
`load_date` Nullable(DateTime64(6))) engine=MergeTree  ORDER BY tuple()
2024-05-29 13:07:44 INF streaming data
2024-05-29 13:07:44 DBG use `db1`
9s 1,119,157 124952 r/s 137 MB | 19% MEM | 7% CPU 2024-05-29 13:07:54 DBG could not select execution from local .sling.db. record not found
19s 2,387,592 126099 r/s 292 MB | 19% MEM | 4% CPU 2024-05-29 13:08:04 DBG could not select execution from local .sling.db. record not found
29s 3,534,503 119697 r/s 434 MB | 20% MEM | 4% CPU 2024-05-29 13:08:14 DBG could not select execution from local .sling.db. record not found
39s 4,870,173 126203 r/s 608 MB | 20% MEM | 6% CPU 2024-05-29 13:08:24 DBG could not select execution from local .sling.db. record not found
49s 6,256,819 132619 r/s 789 MB | 21% MEM | 4% CPU 2024-05-29 13:08:34 DBG could not select execution from local .sling.db. record not found
59s 7,434,366 123818 r/s 942 MB | 21% MEM | 4% CPU 2024-05-29 13:08:44 DBG could not select execution from local .sling.db. record not found
1m9s 8,741,617 126764 r/s 1.1 GB | 22% MEM | 6% CPU 2024-05-29 13:08:54 DBG could not select execution from local .sling.db. record not found
1m19s 10,164,877 133411 r/s 1.3 GB | 22% MEM | 4% CPU 2024-05-29 13:09:04 DBG could not select execution from local .sling.db. record not found
1m29s 11,405,509 129408 r/s 1.4 GB | 22% MEM | 4% CPU 2024-05-29 13:09:14 DBG could not select execution from local .sling.db. record not found
1m39s 12,797,446 134755 r/s 1.6 GB | 23% MEM | 6% CPU 2024-05-29 13:09:24 DBG could not select execution from local .sling.db. record not found
1m49s 13,976,436 127018 r/s 1.7 GB | 23% MEM | 4% CPU 2024-05-29 13:09:34 DBG could not select execution from local .sling.db. record not found
1m59s 15,333,729 131470 r/s 1.9 GB | 24% MEM | 4% CPU 2024-05-29 13:09:44 DBG could not select execution from local .sling.db. record not found
2m9s 16,588,465 128459 r/s 2.1 GB | 24% MEM | 5% CPU 2024-05-29 13:09:54 DBG could not select execution from local .sling.db. record not found
2m19s 17,923,016 131190 r/s 2.2 GB | 24% MEM | 5% CPU 2024-05-29 13:10:04 DBG could not select execution from local .sling.db. record not found
2m29s 19,279,551 131652 r/s 2.4 GB | 25% MEM | 6% CPU 2024-05-29 13:10:14 DBG could not select execution from local .sling.db. record not found
2m39s 20,664,185 134413 r/s 2.6 GB | 25% MEM | 4% CPU 2024-05-29 13:10:24 DBG could not select execution from local .sling.db. record not found
2m49s 21,480,548 102907 r/s 2.7 GB | 28% MEM | 5% CPU 2024-05-29 13:10:34 DBG could not select execution from local .sling.db. record not found
2m58s 21,480,548 56464 r/s 2.7 GB | 30% MEM | 2% CPU 2024-05-29 13:10:43 DBG COPY 21480548 ROWS
2m58s 21,480,548 121167 r/s 2.7 GB | 28% MEM | 2% CPU
2024-05-29 13:10:43 DBG select count(*) cnt from `db1`.`dst_table_20240430_tmp`
2024-05-29 13:10:43 DBG drop table if exists `db1`.`dst_table_20240430`
2024-05-29 13:10:43 DBG table `db1`.`dst_table_20240430` dropped
2024-05-29 13:10:43 DBG create table `db1`.`dst_table_20240430` (`link_key` Nullable(Int64),
`src` Nullable(Int32),
`address` Nullable(String),
`load_date` Nullable(DateTime64(6))) engine=MergeTree  ORDER BY tuple()
2024-05-29 13:10:43 INF created table `db1`.`dst_table_20240430`
2024-05-29 13:10:43 DBG insert into `db1`.`dst_table_20240430` (`link_key`, `src`, `address`, `load_date`) select `link_key`, `src`, `address`, `load_date` from `db1`.`dst_table_20240430_tmp`
2024-05-29 13:10:52 DBG inserted rows into db1.`dst_table_20240430` from temp table `db1`.`dst_table_20240430_tmp`
2024-05-29 13:10:52 INF inserted 21480548 rows into db1.`dst_table_20240430` in 188 secs [114,187 r/s] [2.7 GB]
2024-05-29 13:10:52 DBG drop table if exists `db1`.`dst_table_20240430_tmp`
2024-05-29 13:10:52 DBG table `db1`.`dst_table_20240430_tmp` dropped
2024-05-29 13:10:52 DBG closed "clickhouse" connection (conn-clickhouse-Jo4)
2024-05-29 13:10:52 DBG closed "mysql" connection (conn-mysql-DGF)
2024-05-29 13:10:52 DBG could not select execution from local .sling.db. record not found
2024-05-29 13:10:52 INF execution succeeded
2024-05-29 13:10:52 DBG could not select execution from local .sling.db. record not found
flarco commented 4 months ago

Hi, I'm not sure I understand the problem.

Memory consumption does not allow to load large tables

Are you facing OOM errors? is the process being killed?

The way sling works is by inserting everything in a temp table inside a transaction, then insert into final table from temp table. So, only when the transaction closes that you would be able to see the data.

max-yan commented 4 months ago

Are you facing OOM errors? is the process being killed?

I killed sling process before oom. Memory usage was more than 100GB. _tmp table was empty.

flarco commented 4 months ago

Ah yes, I opened an issue here for this: https://github.com/ClickHouse/clickhouse-go/issues/1293

This is a problem in the 3rd party clickhouse driver that Sling uses.

flarco commented 4 months ago

What you could do it use source_options.limit? or CLI flag --limit with --incremental. I'll see if there is a good way to flush the records so that memory is released.

max-yan commented 4 months ago

For clickhouse sling does not try to use multiple inserts, but writes the whole result at the end. I'm talking about _tmp table, not final.

flarco commented 4 months ago

@max-yan can you compile the binary from branch v1.2.11 and try? I added this line: https://github.com/slingdata-io/sling-cli/pull/303/commits/cc6d22b6bdeed89d7c4e7076af992619a10ad940

max-yan commented 4 months ago

@flarco I can see inserts now but SIGSEGV too

[signal SIGSEGV: segmentation violation code=0x2 addr=0xc05d800000 pc=0x561872]

goroutine 162 gp=0xc001c816c0 m=32 mp=0xc000d2a808 [running]:
runtime.throw({0x346e1ba?, 0x4c4?})
        /usr/local/go/src/runtime/panic.go:1023 +0x5c fp=0xc0000f0d60 sp=0xc0000f0d30 pc=0x4478bc
runtime.sigpanic()
        /usr/local/go/src/runtime/signal_unix.go:895 +0x285 fp=0xc0000f0dc0 sp=0xc0000f0d60 pc=0x460445
strings.IndexAny({0xc05151d304?, 0xc05151d303?}, {0x346c985?, 0x0?})
        /usr/local/go/src/strings/strings.go:161 +0x172 fp=0xc0000f0e38 sp=0xc0000f0dc0 pc=0x561872
github.com/flarco/g/csv.(*Writer).Write(0xc00220fa70, {0xc0048be360?, 0x9, 0xc0048be360?})
        /home/maxim/go/pkg/mod/github.com/flarco/g@v0.1.97/csv/writer.go:84 +0x334 fp=0xc0000f0ee0 sp=0xc0000f0e38 pc=0xa640b4
github.com/slingdata-io/sling-cli/core/dbio/iop.(*Datastream).writeBwCsv(0xc00194a680, {0xc0048be360?, 0x9?, 0x10?})
        /home/maxim/soft_src/sling-cli/core/dbio/iop/datastream.go:282 +0x27 fp=0xc0000f0f10 sp=0xc0000f0ee0 pc=0x1187407
github.com/slingdata-io/sling-cli/core/dbio/iop.(*Datastream).processBwRows.func1()
        /home/maxim/soft_src/sling-cli/core/dbio/iop/datastream.go:210 +0xc5 fp=0xc0000f0f80 sp=0xc0000f0f10 pc=0x1186be5
github.com/slingdata-io/sling-cli/core/dbio/iop.(*Datastream).processBwRows(0xc00194a680)
        /home/maxim/soft_src/sling-cli/core/dbio/iop/datastream.go:219 +0x91 fp=0xc0000f0fc8 sp=0xc0000f0f80 pc=0x1186af1
github.com/slingdata-io/sling-cli/core/dbio/iop.(*Datastream).Start.gowrap1()
        /home/maxim/soft_src/sling-cli/core/dbio/iop/datastream.go:774 +0x25 fp=0xc0000f0fe0 sp=0xc0000f0fc8 pc=0x118cb05
runtime.goexit({})
        /usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0000f0fe8 sp=0xc0000f0fe0 pc=0x482a41
created by github.com/slingdata-io/sling-cli/core/dbio/iop.(*Datastream).Start in goroutine 151
        /home/maxim/soft_src/sling-cli/core/dbio/iop/datastream.go:774 +0x121c
flarco commented 4 months ago

Try again with env var SLING_PROCESS_BW=false? I have researched but don't know why the segmentation violation occurs... It happens randomly, non-deterministic.

flarco commented 4 months ago

@max-yan how is the memory usage? did it improve?

max-yan commented 4 months ago

@flarco SIGSEGV not repeatable. I was able to load without "SLING_PROCESS_BW=false". I don't see memory leaks with a 20GB csv file (238576508 rows).

flarco commented 4 months ago

Great. @alisman FYI, looks like the batching the inserts works. Will be fixed in next release v1.2.11.

max-yan commented 4 months ago

I used batch.Limit = 2000000. option needed

flarco commented 4 months ago

@max-yan yes agreed, will add. This was to test.

alisman commented 4 months ago

Excellent! THanks all!

--Aaron

On Fri, May 31, 2024 at 9:33 AM Fritz Larco @.***> wrote:

@max-yan https://github.com/max-yan yes agreed, will add. This was to test.

— Reply to this email directly, view it on GitHub https://github.com/slingdata-io/sling-cli/issues/312#issuecomment-2142176795, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABNRGPYZTZ3WV7YOSBIDI3ZFB34FAVCNFSM6AAAAABIOV6ZZKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBSGE3TMNZZGU . You are receiving this because you were mentioned.Message ID: @.***>

flarco commented 3 months ago

Added target_options.batch_limit. Closing.