slingdata-io / sling-cli

Sling is a CLI tool that extracts data from a source storage/database and loads it in a target storage/database.
https://docs.slingdata.io
GNU General Public License v3.0
301 stars 16 forks source link

Add Escaping to CSV Reader #250

Closed rishabhkaushal07 closed 2 months ago

rishabhkaushal07 commented 2 months ago

Issue Description

sling doesn't automatically consider backslashed quotes \" that are inside quotes when loading a csv file. For instance, consider this example: column1_data|"column2 data with \"space\""|column3_data.

export STARROCKS='{ type: starrocks, url: "starrocks://root@<host_ip>:9030/db", fe_url: "http://<host_ip>:8030"}'

./sling run \
    --src-stream file:///SFU_Fact_Screening_2019_quotes_preprocessed4.csv \
    --src-options '{"format": "csv", "options": {"delimiter": "|", "header": true}}' \
    --tgt-conn STARROCKS \
    --tgt-object db.SFU_Fact_Screening_2019  \
    --mode full-refresh \
    --debug
39m12s 50,243,563 21133 r/s 32 GB | 24% MEM | 35% CPU 2024-04-01 23:32:44 DBG stream-load completed for /tmp/starrocks/db/SFU_Fact_Screening_2016_tmp/2024-04-01T225331.105/part.01.0201.csv => {
    "TxnId": 2067,
    "Label": "06486c9f-eb7f-4b32-9abb-60a928fafdab",
    "Status": "Success",
    "Message": "OK",
    "NumberTotalRows": 250000,
    "NumberLoadedRows": 250000,
    "NumberFilteredRows": 0,
    "NumberUnselectedRows": 0,
    "LoadBytes": 161816798,
    "LoadTimeMs": 1286,
    "BeginTxnTimeMs": 0,
    "StreamLoadPlanTimeMs": 0,
    "ReadDataTimeMs": 820,
    "WriteDataTimeMs": 1264,
    "CommitAndPublishTimeMs": 20
}
39m13s 50,264,679 21132 r/s 32 GB | 24% MEM | 34%39m14s 50,285,881 21137 r/s 32 GB | 24% MEM | 30%39m15s 50,307,326 21157 r/s 32 GB | 23% MEM | 30%39m16s 50,328,840 21180 r/s 32 GB | 23% MEM | 29%39m17s 50,350,427 21206 r/s 32 GB | 22% MEM | 31%39m18s 50,371,967 21228 r/s 32 GB | 22% MEM | 29%39m19s 50,393,599 21254 r/s 32 GB | 22% MEM | 28%39m20s 50,415,289 21282 r/s 32 GB | 22% MEM | 28%39m21s 50,436,768 21294 r/s 33 GB | 23% MEM | 42%39m22s 50,457,489 21257 r/s 33 GB | 24% MEM | 63%39m23s 50,476,640 21121 r/s 33 GB | 24% MEM | 38% CPU 2024-04-01 23:32:54 DBG loading /tmp/starrocks/db/SFU_Fact_Screening_2016_tmp/2024-04-01T225331.105/part.01.0202.csv [165 MB] ds.1712037210845.LfE-0
2024-04-01 23:32:55 DBG loading /tmp/starrocks/db/SFU_Fact_Screening_2016_tmp/2024-04-01T225331.105/part.01.0203.csv [5.3 MB] ds.1712037210845.LfE-0
39m24s 50,497,742 21374 r/s 33 GB | 24% MEM | 34% CPU 
2024-04-01 23:32:55 DBG drop table if exists `db`.`SFU_Fact_Screening_2016_tmp`
2024-04-01 23:32:55 DBG table `db`.`SFU_Fact_Screening_2016_tmp` dropped
2024-04-01 23:32:55 DBG closed "starrocks" connection (conn-starrocks-nw3)
2024-04-01 23:32:55 INF execution failed

Perhaps setting the delimiter (source_options.delimiter) would help? See https://docs.slingdata.io/sling-cli/run/configuration#source

2024-04-01 23:32:55 DBG stream-load completed for /tmp/starrocks/db/SFU_Fact_Screening_2016_tmp/2024-04-01T225331.105/part.01.0203.csv => {
    "TxnId": -1,
    "Label": "9215d99c-c373-4552-8676-4e5877a60de1",
    "Status": "Fail",
    "Message": "unknown table \"db.SFU_Fact_Screening_2016_tmp\"",
    "NumberTotalRows": 0,
    "NumberLoadedRows": 0,
    "NumberFilteredRows": 0,
    "NumberUnselectedRows": 0,
    "LoadBytes": 0,
    "LoadTimeMs": 0,
    "BeginTxnTimeMs": 0,
    "StreamLoadPlanTimeMs": 0,
    "ReadDataTimeMs": 0,
    "WriteDataTimeMs": 0,
    "CommitAndPublishTimeMs": 0
}
fatal:
--- sling_cli.go:418 func1 ---
--- sling_cli.go:474 cliInit ---
--- cli.go:284 CliProcess ---
~ failure running task (see docs @ https://docs.slingdata.io/sling-cli)
--- sling_logic.go:224 processRun ---
--- sling_logic.go:371 runTask ---
~ execution failed
--- task_run.go:138 Execute ---

--- fs.go:756 func1 ---

--- datastream.go:705 func9 ---
--- datastream.go:2040 next ---
~ Error reading file
--- datastream.go:910 func1 ---
record on line 50507895: wrong number of fields

context canceled

--- datastream.go:787 func9 ---
~ error during iteration
--- datastream.go:700 2 ---

--- datastream.go:705 func9 ---
--- datastream.go:2040 next ---
~ Error reading file
--- datastream.go:910 func1 ---
record on line 50507895: wrong number of fields

context canceled

context canceled

--- database_starrocks.go:430 func2 ---
~ error writing dataflow to local storage: /tmp/starrocks/db/SFU_Fact_Screening_2016_tmp/2024-04-01T225331.105
--- fs.go:795 WriteDataflowReady ---

--- datastream.go:705 func9 ---
--- datastream.go:2040 next ---
~ Error reading file
--- datastream.go:910 func1 ---
record on line 50507895: wrong number of fields

context canceled

--- datastream.go:787 func9 ---
~ error during iteration
--- datastream.go:700 2 ---

--- datastream.go:705 func9 ---
--- datastream.go:2040 next ---
~ Error reading file
--- datastream.go:910 func1 ---
record on line 50507895: wrong number of fields

context canceled

--- fs.go:756 func1 ---

--- datastream.go:705 func9 ---
--- datastream.go:2040 next ---
~ Error reading file
--- datastream.go:910 func1 ---
record on line 50507895: wrong number of fields

context canceled

--- datastream.go:787 func9 ---
~ error during iteration
--- datastream.go:700 2 ---

--- datastream.go:705 func9 ---
--- datastream.go:2040 next ---
~ Error reading file
--- datastream.go:910 func1 ---
record on line 50507895: wrong number of fields

context canceled

--- task_run.go:97 func1 ---
~ could not write to database
--- task_run.go:387 runFileToDB ---
~ could not insert into `db`.`SFU_Fact_Screening_2016_tmp`.
--- task_run_write.go:307 WriteToDb ---

--- datastream.go:705 func9 ---
--- datastream.go:2040 next ---
~ Error reading file
--- datastream.go:910 func1 ---
record on line 50507895: wrong number of fields

context canceled

--- datastream.go:787 func9 ---
~ error during iteration
--- datastream.go:700 2 ---

--- datastream.go:705 func9 ---
--- datastream.go:2040 next ---
~ Error reading file
--- datastream.go:910 func1 ---
record on line 50507895: wrong number of fields
flarco commented 2 months ago

Thanks. Will take a look.

flarco commented 2 months ago

Done, please use 1.2.4

flarco commented 2 months ago

FYI You have to specify source_options.escape: "\"