slingdata-io / sling-cli

Sling is a CLI tool that extracts data from a source storage/database and loads it in a target storage/database.
https://docs.slingdata.io
GNU General Public License v3.0
444 stars 34 forks source link

[bug][clickhouse] column name conflict after casting #417

Open yokofly opened 1 month ago

yokofly commented 1 month ago

Issue Description

Query id: 2fcd3457-fecc-4243-9dfd-93e78b4af3e8

┌─statement─────────────────────────────────────────────────────────────┐

  1. │ CREATE TABLE default.vv ( id Int32, Id Int32 ) ENGINE = Memory │ └───────────────────────────────────────────────────────────────────────┘

1 row in set. Elapsed: 0.002 sec.

and wanna use sling to sync for another table, but find the column naming conflict.

./sling run --src-conn CLICKHOUSE --src-stream 'vv' --tgt-conn CLICKHOUSE --tgt-object "vv2" --mode full-refresh -d


- Sling version (`sling --version`): 
wget https://github.com/slingdata-io/sling-cli/releases/download/v1.2.21/sling_linux_amd64.tar.gz

- Operating System (`linux`, `mac`, `windows`): 
linux

(base) ➜ ck export CLICKHOUSE='clickhouse://default@localhost:9000/default'
(base) ➜ ck ./sling run --src-conn CLICKHOUSE --src-stream 'vv' --tgt-conn CLICKHOUSE --tgt-object "vv2" --mode full-refresh -d 2024-10-23 07:18:21 DBG Sling version: 1.2.21 (linux amd64) 2024-10-23 07:18:21 DBG type is db-db 2024-10-23 07:18:21 DBG using: {"columns":null,"mode":"full-refresh","transforms":null} 2024-10-23 07:18:21 DBG using source options: {"empty_as_null":false,"null_if":"NULL","datetime_format":"AUTO","max_decimals":11} 2024-10-23 07:18:21 DBG using target options: {"batch_limit":100000,"datetime_format":"auto","file_max_rows":0,"max_decimals":11,"use_bulk":true,"add_new_columns":true,"adjust_column_type":false,"column_casing":"source"} 2024-10-23 07:18:21 DBG opened "clickhouse" connection (conn-clickhouse-mlD) 2024-10-23 07:18:21 DBG opened "clickhouse" connection (conn-clickhouse-dFr) 2024-10-23 07:18:21 INF connecting to source database (clickhouse) 2024-10-23 07:18:21 INF connecting to target database (clickhouse) 2024-10-23 07:18:21 INF reading from source database 2024-10-23 07:18:21 DBG select * from default.vv 2024-10-23 07:18:21 INF writing to target database [mode: full-refresh] 2024-10-23 07:18:21 DBG drop table if exists default.vv2_tmp 2024-10-23 07:18:21 DBG table default.vv2_tmp dropped 2024-10-23 07:18:21 DBG create table default.vv2_tmp (id Nullable(Int64), Id Nullable(Int64)) engine=MergeTree ORDER BY tuple() 2024-10-23 07:18:21 INF created table default.vv2_tmp 2024-10-23 07:18:21 INF streaming data 2024-10-23 07:18:21 DBG use default 2024-10-23 07:18:21 DBG drop table if exists default.vv2_tmp 2024-10-23 07:18:21 DBG table default.vv2_tmp dropped 2024-10-23 07:18:21 DBG closed "clickhouse" connection (conn-clickhouse-dFr) 2024-10-23 07:18:21 DBG closed "clickhouse" connection (conn-clickhouse-mlD) 2024-10-23 07:18:21 INF execution failed fatal: --- proc.go:271 main --- --- sling_cli.go:458 main --- --- sling_cli.go:494 cliInit --- --- cli.go:286 CliProcess --- --- sling_run.go:225 processRun --- ~ failure running task (see docs @ https://docs.slingdata.io/sling-cli) --- sling_run.go:396 runTask --- --- task_run.go:155 Execute ---

--- task_run.go:116 func2 --- --- task_run.go:559 runDbToDb --- --- task_run_write.go:234 WriteToDb --- --- database.go:2313 BulkImportFlow --- ~ could not bulk import --- database.go:2300 func1 --- ~ could not copy data --- database_clickhouse.go:261 BulkImportStream --- ~ could not prepare statement --- database_clickhouse.go:171 func2 --- ~ could not prepare statement --- database.go:1062 Prepare --- ~ could not prepare Tx: insert into default.vv2_tmp (Id, Id) values ($1, $2) --- transaction.go:95 Prepare --- code: 15, message: Column Id in table default.vv2_tmp (44a2af7b-757a-4ec1-a543-ff3bd745ce0a) specified more than once

--- task_run.go:116 func2 --- ~ Could not WriteToDb --- task_run.go:559 runDbToDb --- ~ could not insert into default.vv2_tmp --- task_run_write.go:240 WriteToDb --- (base) ➜ ck ./clickhouse client -q "show create vv" CREATE TABLE default.vv\n(\n id Int32,\n Id Int32\n)\nENGINE = Memory

yokofly commented 1 month ago

@flarco any comments? i do not know the context why we need cast to lower or cast to upper for some databases.

flarco commented 1 month ago

@yokofly I need to look into it. Column names have cause issues in the past, especially when sourcing from files. But slinging from db to do should keep original name, agreed.

flarco commented 4 weeks ago

@yokofly actually, this is different. This is happening here, where Id and id resolve to the same Id column from the GetColumn function. This is tricky because, when dealing with different casings depending on the database system, applying a lower case on the key to normalize in order to lookup (in GetColumn) is necessary. Any ideas?

yokofly commented 4 weeks ago

@flarco sorry, I'm still confused why we can't use the same naming convention. If we ensure the source db stream and potential target db table use the same conversion, that should be enough.

The only issue I see is when we use a customized SQL for the source stream to retrieve columns, as it's hard to maintain the conversion in that case.

yokofly commented 4 weeks ago

Column names have cause issues in the past, especially when sourcing from files.

I am not familiar with other dbms/drivers, but I think even work with db to file or file to db, we are also no need conversion, or we are worried the incorrect header file to import db?

yokofly commented 4 weeks ago

I check the code again, we use to_lower(name) in many places. it's prob not an easy change.

flarco commented 4 weeks ago

it's prob not an easy change.

Exactly. The ValidateColumnNames function is important to build the insert statement correctly. Using lower-case as a key is the best way (I can think of) to compare upper/lower/mixed-case column names across various file/database systems... also there is the column-casing target-option to consider (which is specified upstream in the logic sequence).

This would probably need some careful refactoring, which I don't think is worth it at the moment, since your situation is fairly rare, where 2 columns, in the same table, have the same name, but just different letter casing.

flarco commented 4 weeks ago

@yokofly I was thinking about it, and I believe this change could solve your issue. Could you try the latest dev build?

yokofly commented 4 weeks ago

@flarco thanks, I took a quick try, seems not work.

(base) ➜  sling-cli git:(v1.2.22) ✗ bash scripts/build.sh &&  ./sling run --src-conn CLICKHOUSE  --src-stream 'v22' --tgt-conn CLICKHOUSE  --tgt-object "v232" --mode full-refresh -d 
2024-10-24 04:38:39 DBG Sling version: dev (linux amd64)
2024-10-24 04:38:39 DBG type is db-db
2024-10-24 04:38:39 DBG using: {"columns":null,"mode":"full-refresh","transforms":null}
2024-10-24 04:38:39 DBG using source options: {"empty_as_null":false,"null_if":"NULL","datetime_format":"AUTO","max_decimals":11}
2024-10-24 04:38:39 DBG using target options: {"batch_limit":100000,"datetime_format":"auto","file_max_rows":0,"max_decimals":11,"use_bulk":true,"add_new_columns":true,"adjust_column_type":false,"column_casing":"source"}
2024-10-24 04:38:39 DBG opened "clickhouse" connection (conn-clickhouse-23n)
2024-10-24 04:38:39 DBG opened "clickhouse" connection (conn-clickhouse-boD)
2024-10-24 04:38:39 INF connecting to source database (clickhouse)
2024-10-24 04:38:39 INF connecting to target database (clickhouse)
2024-10-24 04:38:39 INF reading from source database
2024-10-24 04:38:39 DBG select * from `default`.`v22`
2024-10-24 04:38:39 INF writing to target database [mode: full-refresh]
2024-10-24 04:38:39 DBG drop table if exists `default`.`v232_tmp`
2024-10-24 04:38:39 DBG table `default`.`v232_tmp` dropped
2024-10-24 04:38:39 DBG create table `default`.`v232_tmp` (`id` Nullable(Int64),
`Id` Nullable(Int64)) engine=MergeTree   ORDER BY tuple()
2024-10-24 04:38:39 INF created table `default`.`v232_tmp`
2024-10-24 04:38:39 INF streaming data
2024-10-24 04:38:39 DBG use `default`
2024-10-24 04:38:39 DBG drop table if exists `default`.`v232_tmp`
2024-10-24 04:38:39 DBG table `default`.`v232_tmp` dropped
2024-10-24 04:38:39 DBG closed "clickhouse" connection (conn-clickhouse-boD)
2024-10-24 04:38:39 DBG closed "clickhouse" connection (conn-clickhouse-23n)
2024-10-24 04:38:39 INF execution failed
fatal:
--- proc.go:271 main ---
--- sling_cli.go:458 main ---
--- sling_cli.go:494 cliInit ---
--- cli.go:286 CliProcess ---
--- sling_run.go:225 processRun ---
~ failure running task (see docs @ https://docs.slingdata.io/sling-cli)
--- sling_run.go:396 runTask ---
--- task_run.go:155 Execute ---

--- task_run.go:116 func2 ---
--- task_run.go:559 runDbToDb ---
--- task_run_write.go:241 WriteToDb ---
--- database.go:2303 BulkImportFlow ---
~ could not bulk import
--- database.go:2290 func1 ---
~ could not copy data
--- database_clickhouse.go:261 BulkImportStream ---
~ could not prepare statement
--- database_clickhouse.go:171 func2 ---
~ could not prepare statement
--- database.go:1062 Prepare ---
~ could not prepare Tx: insert into `default`.`v232_tmp` (`Id`, `Id`) values  ($1, $2)
--- transaction.go:95 Prepare ---
code: 15, message: Column Id in table default.v232_tmp (0eb12dd3-3297-4531-95b7-fad720ac2d7e) specified more than once

--- task_run.go:116 func2 ---
~ Could not WriteToDb
--- task_run.go:559 runDbToDb ---
~ could not insert into `default`.`v232_tmp`
--- task_run_write.go:247 WriteToDb ---
(base) ➜  sling-cli git:(v1.2.22)

I took several hours to debug the csv related change :laughing: btw weeks ago, I fork this repo, because last time holiday weeks, bugfix/enhance will be posted to this repo later, I need some time to gather PR.

my current code works like below, because I am not sure if break the things, so I intend to extract a function to say database naming is sensitive. https://github.com/timeplus-io/sling-cli/pull/35/commits/bd273cb7bf69249a0c9e60e2d7032e9f2cd084b4#diff-a06b343a2a892e7f04c5f6d6b1ee74d8c1bab25e2545304c95cbb0cbb11a2691R599-R607

yokofly commented 4 weeks ago

I agree this is a rare case, I can post a pr later

flarco commented 4 weeks ago

OK made another change, I think that should do it. If not, can be figured out later.

yokofly commented 4 weeks ago

sorry, seems still not work.

full log ``` (base) ➜ Downloads ./sling run --src-conn CLICKHOUSE --src-stream 'v22' --tgt-conn CLICKHOUSE --tgt-object "v232" --mode full-refresh -d 2024-10-24 07:49:06 DBG Sling version: 1.2.22 (linux amd64) 2024-10-24 07:49:06 DBG type is db-db 2024-10-24 07:49:06 DBG using: {"columns":null,"mode":"full-refresh","transforms":null} 2024-10-24 07:49:06 DBG using source options: {"empty_as_null":false,"null_if":"NULL","datetime_format":"AUTO","max_decimals":11} 2024-10-24 07:49:06 DBG using target options: {"batch_limit":100000,"datetime_format":"auto","file_max_rows":0,"max_decimals":11,"use_bulk":true,"add_new_columns":true,"adjust_column_type":false,"column_casing":"source"} 2024-10-24 07:49:06 DBG opened "clickhouse" connection (conn-clickhouse-pVu) 2024-10-24 07:49:06 DBG opened "clickhouse" connection (conn-clickhouse-vSe) 2024-10-24 07:49:06 INF connecting to source database (clickhouse) 2024-10-24 07:49:06 INF connecting to target database (clickhouse) 2024-10-24 07:49:06 INF reading from source database 2024-10-24 07:49:06 DBG select * from `default`.`v22` 2024-10-24 07:49:06 INF writing to target database [mode: full-refresh] 2024-10-24 07:49:06 DBG drop table if exists `default`.`v232_tmp` 2024-10-24 07:49:06 DBG table `default`.`v232_tmp` dropped 2024-10-24 07:49:06 DBG create table `default`.`v232_tmp` (`id` Nullable(Int64), `Id` Nullable(Int64)) engine=MergeTree ORDER BY tuple() 2024-10-24 07:49:06 INF created table `default`.`v232_tmp` 2024-10-24 07:49:06 INF streaming data 2024-10-24 07:49:06 DBG use `default` 2024-10-24 07:49:06 DBG drop table if exists `default`.`v232_tmp` 2024-10-24 07:49:06 DBG table `default`.`v232_tmp` dropped 2024-10-24 07:49:06 DBG closed "clickhouse" connection (conn-clickhouse-vSe) 2024-10-24 07:49:06 DBG closed "clickhouse" connection (conn-clickhouse-pVu) 2024-10-24 07:49:06 INF execution failed fatal: --- proc.go:271 main --- --- sling_cli.go:458 main --- --- sling_cli.go:494 cliInit --- --- cli.go:286 CliProcess --- --- sling_run.go:225 processRun --- ~ failure running task (see docs @ https://docs.slingdata.io/sling-cli) --- sling_run.go:396 runTask --- --- task_run.go:155 Execute --- --- task_run.go:116 func2 --- --- task_run.go:559 runDbToDb --- --- task_run_write.go:241 WriteToDb --- --- database.go:2303 BulkImportFlow --- ~ could not bulk import --- database.go:2290 func1 --- ~ could not copy data --- database_clickhouse.go:261 BulkImportStream --- ~ could not prepare statement --- database_clickhouse.go:171 func2 --- ~ could not prepare statement --- database.go:1062 Prepare --- ~ could not prepare Tx: insert into `default`.`v232_tmp` (`Id`, `Id`) values ($1, $2) --- transaction.go:95 Prepare --- code: 15, message: Column Id in table default.v232_tmp (d50c4847-ffeb-4a28-80ba-f290e8d70e00) specified more than once --- task_run.go:116 func2 --- ~ Could not WriteToDb --- task_run.go:559 runDbToDb --- ~ could not insert into `default`.`v232_tmp` --- task_run_write.go:247 WriteToDb --- (base) ➜ Downloads ```