Open yokofly opened 1 month ago
@flarco any comments? i do not know the context why we need cast to lower or cast to upper for some databases.
@yokofly I need to look into it. Column names have cause issues in the past, especially when sourcing from files. But slinging from db to do should keep original name, agreed.
@yokofly actually, this is different. This is happening here, where Id
and id
resolve to the same Id
column from the GetColumn
function.
This is tricky because, when dealing with different casings depending on the database system, applying a lower case on the key to normalize in order to lookup (in GetColumn
) is necessary. Any ideas?
@flarco sorry, I'm still confused why we can't use the same naming convention. If we ensure the source db stream and potential target db table use the same conversion, that should be enough.
The only issue I see is when we use a customized SQL for the source stream to retrieve columns, as it's hard to maintain the conversion in that case.
Column names have cause issues in the past, especially when sourcing from files.
I am not familiar with other dbms/drivers, but I think even work with db to file or file to db, we are also no need conversion, or we are worried the incorrect header file to import db?
I check the code again, we use to_lower(name) in many places. it's prob not an easy change.
it's prob not an easy change.
Exactly. The ValidateColumnNames
function is important to build the insert statement correctly. Using lower-case as a key is the best way (I can think of) to compare upper/lower/mixed-case column names across various file/database systems... also there is the column-casing target-option to consider (which is specified upstream in the logic sequence).
This would probably need some careful refactoring, which I don't think is worth it at the moment, since your situation is fairly rare, where 2 columns, in the same table, have the same name, but just different letter casing.
@yokofly I was thinking about it, and I believe this change could solve your issue. Could you try the latest dev build?
@flarco thanks, I took a quick try, seems not work.
(base) ➜ sling-cli git:(v1.2.22) ✗ bash scripts/build.sh && ./sling run --src-conn CLICKHOUSE --src-stream 'v22' --tgt-conn CLICKHOUSE --tgt-object "v232" --mode full-refresh -d
2024-10-24 04:38:39 DBG Sling version: dev (linux amd64)
2024-10-24 04:38:39 DBG type is db-db
2024-10-24 04:38:39 DBG using: {"columns":null,"mode":"full-refresh","transforms":null}
2024-10-24 04:38:39 DBG using source options: {"empty_as_null":false,"null_if":"NULL","datetime_format":"AUTO","max_decimals":11}
2024-10-24 04:38:39 DBG using target options: {"batch_limit":100000,"datetime_format":"auto","file_max_rows":0,"max_decimals":11,"use_bulk":true,"add_new_columns":true,"adjust_column_type":false,"column_casing":"source"}
2024-10-24 04:38:39 DBG opened "clickhouse" connection (conn-clickhouse-23n)
2024-10-24 04:38:39 DBG opened "clickhouse" connection (conn-clickhouse-boD)
2024-10-24 04:38:39 INF connecting to source database (clickhouse)
2024-10-24 04:38:39 INF connecting to target database (clickhouse)
2024-10-24 04:38:39 INF reading from source database
2024-10-24 04:38:39 DBG select * from `default`.`v22`
2024-10-24 04:38:39 INF writing to target database [mode: full-refresh]
2024-10-24 04:38:39 DBG drop table if exists `default`.`v232_tmp`
2024-10-24 04:38:39 DBG table `default`.`v232_tmp` dropped
2024-10-24 04:38:39 DBG create table `default`.`v232_tmp` (`id` Nullable(Int64),
`Id` Nullable(Int64)) engine=MergeTree ORDER BY tuple()
2024-10-24 04:38:39 INF created table `default`.`v232_tmp`
2024-10-24 04:38:39 INF streaming data
2024-10-24 04:38:39 DBG use `default`
2024-10-24 04:38:39 DBG drop table if exists `default`.`v232_tmp`
2024-10-24 04:38:39 DBG table `default`.`v232_tmp` dropped
2024-10-24 04:38:39 DBG closed "clickhouse" connection (conn-clickhouse-boD)
2024-10-24 04:38:39 DBG closed "clickhouse" connection (conn-clickhouse-23n)
2024-10-24 04:38:39 INF execution failed
fatal:
--- proc.go:271 main ---
--- sling_cli.go:458 main ---
--- sling_cli.go:494 cliInit ---
--- cli.go:286 CliProcess ---
--- sling_run.go:225 processRun ---
~ failure running task (see docs @ https://docs.slingdata.io/sling-cli)
--- sling_run.go:396 runTask ---
--- task_run.go:155 Execute ---
--- task_run.go:116 func2 ---
--- task_run.go:559 runDbToDb ---
--- task_run_write.go:241 WriteToDb ---
--- database.go:2303 BulkImportFlow ---
~ could not bulk import
--- database.go:2290 func1 ---
~ could not copy data
--- database_clickhouse.go:261 BulkImportStream ---
~ could not prepare statement
--- database_clickhouse.go:171 func2 ---
~ could not prepare statement
--- database.go:1062 Prepare ---
~ could not prepare Tx: insert into `default`.`v232_tmp` (`Id`, `Id`) values ($1, $2)
--- transaction.go:95 Prepare ---
code: 15, message: Column Id in table default.v232_tmp (0eb12dd3-3297-4531-95b7-fad720ac2d7e) specified more than once
--- task_run.go:116 func2 ---
~ Could not WriteToDb
--- task_run.go:559 runDbToDb ---
~ could not insert into `default`.`v232_tmp`
--- task_run_write.go:247 WriteToDb ---
(base) ➜ sling-cli git:(v1.2.22)
I took several hours to debug the csv related change :laughing: btw weeks ago, I fork this repo, because last time holiday weeks, bugfix/enhance will be posted to this repo later, I need some time to gather PR.
my current code works like below, because I am not sure if break the things, so I intend to extract a function to say database naming is sensitive. https://github.com/timeplus-io/sling-cli/pull/35/commits/bd273cb7bf69249a0c9e60e2d7032e9f2cd084b4#diff-a06b343a2a892e7f04c5f6d6b1ee74d8c1bab25e2545304c95cbb0cbb11a2691R599-R607
I agree this is a rare case, I can post a pr later
OK made another change, I think that should do it. If not, can be figured out later.
sorry, seems still not work.
Issue Description
Query id: 2fcd3457-fecc-4243-9dfd-93e78b4af3e8
┌─statement─────────────────────────────────────────────────────────────┐
id
Int32,Id
Int32 ) ENGINE = Memory │ └───────────────────────────────────────────────────────────────────────┘1 row in set. Elapsed: 0.002 sec.
./sling run --src-conn CLICKHOUSE --src-stream 'vv' --tgt-conn CLICKHOUSE --tgt-object "vv2" --mode full-refresh -d
(base) ➜ ck export CLICKHOUSE='clickhouse://default@localhost:9000/default'
(base) ➜ ck ./sling run --src-conn CLICKHOUSE --src-stream 'vv' --tgt-conn CLICKHOUSE --tgt-object "vv2" --mode full-refresh -d 2024-10-23 07:18:21 DBG Sling version: 1.2.21 (linux amd64) 2024-10-23 07:18:21 DBG type is db-db 2024-10-23 07:18:21 DBG using: {"columns":null,"mode":"full-refresh","transforms":null} 2024-10-23 07:18:21 DBG using source options: {"empty_as_null":false,"null_if":"NULL","datetime_format":"AUTO","max_decimals":11} 2024-10-23 07:18:21 DBG using target options: {"batch_limit":100000,"datetime_format":"auto","file_max_rows":0,"max_decimals":11,"use_bulk":true,"add_new_columns":true,"adjust_column_type":false,"column_casing":"source"} 2024-10-23 07:18:21 DBG opened "clickhouse" connection (conn-clickhouse-mlD) 2024-10-23 07:18:21 DBG opened "clickhouse" connection (conn-clickhouse-dFr) 2024-10-23 07:18:21 INF connecting to source database (clickhouse) 2024-10-23 07:18:21 INF connecting to target database (clickhouse) 2024-10-23 07:18:21 INF reading from source database 2024-10-23 07:18:21 DBG select * from
default
.vv
2024-10-23 07:18:21 INF writing to target database [mode: full-refresh] 2024-10-23 07:18:21 DBG drop table if existsdefault
.vv2_tmp
2024-10-23 07:18:21 DBG tabledefault
.vv2_tmp
dropped 2024-10-23 07:18:21 DBG create tabledefault
.vv2_tmp
(id
Nullable(Int64),Id
Nullable(Int64)) engine=MergeTree ORDER BY tuple() 2024-10-23 07:18:21 INF created tabledefault
.vv2_tmp
2024-10-23 07:18:21 INF streaming data 2024-10-23 07:18:21 DBG usedefault
2024-10-23 07:18:21 DBG drop table if existsdefault
.vv2_tmp
2024-10-23 07:18:21 DBG tabledefault
.vv2_tmp
dropped 2024-10-23 07:18:21 DBG closed "clickhouse" connection (conn-clickhouse-dFr) 2024-10-23 07:18:21 DBG closed "clickhouse" connection (conn-clickhouse-mlD) 2024-10-23 07:18:21 INF execution failed fatal: --- proc.go:271 main --- --- sling_cli.go:458 main --- --- sling_cli.go:494 cliInit --- --- cli.go:286 CliProcess --- --- sling_run.go:225 processRun --- ~ failure running task (see docs @ https://docs.slingdata.io/sling-cli) --- sling_run.go:396 runTask --- --- task_run.go:155 Execute ------ task_run.go:116 func2 --- --- task_run.go:559 runDbToDb --- --- task_run_write.go:234 WriteToDb --- --- database.go:2313 BulkImportFlow --- ~ could not bulk import --- database.go:2300 func1 --- ~ could not copy data --- database_clickhouse.go:261 BulkImportStream --- ~ could not prepare statement --- database_clickhouse.go:171 func2 --- ~ could not prepare statement --- database.go:1062 Prepare --- ~ could not prepare Tx: insert into
default
.vv2_tmp
(Id
,Id
) values ($1, $2) --- transaction.go:95 Prepare --- code: 15, message: Column Id in table default.vv2_tmp (44a2af7b-757a-4ec1-a543-ff3bd745ce0a) specified more than once--- task_run.go:116 func2 --- ~ Could not WriteToDb --- task_run.go:559 runDbToDb --- ~ could not insert into
default
.vv2_tmp
--- task_run_write.go:240 WriteToDb --- (base) ➜ ck ./clickhouse client -q "show create vv" CREATE TABLE default.vv\n(\nid
Int32,\nId
Int32\n)\nENGINE = Memory