slingdata-io / sling-cli

Sling is a CLI tool that extracts data from a source storage/database and loads it in a target storage/database.
https://docs.slingdata.io
GNU General Public License v3.0
433 stars 35 forks source link

Parquet invalid encoding / not valid UTF8 #435

Open OneCyrus opened 1 day ago

OneCyrus commented 1 day ago

Issue Description

"select * from output.parquet" with the latest duckdb results in: Invalid Input Error: Invalid string encoding found in Parquet file: value "\x00\x00\x00\x00\xA8Y\xE2w" is not valid UTF8!

export MSSQL='sqlserver://user:pw@server:1433?database=mytable'
sling run --src-conn MSSQL --src-stream \"SELECT * FROM dbo.[ConfigurationItem]\" --tgt-object 'file:///runner/project/output.parquet' -d
2024-11-06 15:01:15 DBG Sling version: 1.2.22 (linux amd64)
2024-11-06 15:01:15 DBG type is db-file
2024-11-06 15:01:15 DBG using: {"columns":null,"mode":"full-refresh","transforms":null}
2024-11-06 15:01:15 DBG using source options: {"empty_as_null":false,"null_if":"NULL","datetime_format":"AUTO","max_decimals":-1}
2024-11-06 15:01:15 DBG using target options: {"header":true,"compression":"auto","concurrency":7,"datetime_format":"auto","delimiter":",","file_max_rows":0,"file_max_bytes":0,"max_decimals":-1,"use_bulk":true,"add_new_columns":true,"adjust_column_type":false,"column_casing":"source"}
2024-11-06 15:01:15 DBG opened "sqlserver" connection (conn-sqlserver-nU9)
2024-11-06 15:01:15 INF connecting to source database (sqlserver)
2024-11-06 15:01:15 INF reading from source database
2024-11-06 15:01:15 DBG SELECT * FROM dbo.[ConfigurationItem]
2024-11-06 15:01:16 INF writing to target file system (file)
2024-11-06 15:01:16 DBG opened "file" connection (conn-file-DLa)
2024-11-06 15:01:16 DBG writing to file:///runner/project/output.parquet [fileRowLimit=0 fileBytesLimit=0 compression=auto concurrency=7 useBufferedStream=false fileFormat=parquet singleFile=true]
[90m2024-11-06 15:05:47 DBG wrote 138 MB: 467182 rows [1,714 r/s]
4m29s 466,602 1737 r/s 1.4 GB | 58% MEM | 86% CPU 2024-11-06 15:05:47 INF wrote 467182 rows [1,714 r/s] to file:///runner/project/output.parquet
2024-11-06 15:05:47 DBG closed "sqlserver" connection (conn-sqlserver-nU9)
2024-11-06 15:05:47 INF execution succeeded
flarco commented 1 day ago

Yes, sling will actually soon use duckdb under the hood to read/write parquet files. The Go driver (github.com/apache/arrow/go) is unfortunately not great quality, and has given many issues. Stay tuned.

OneCyrus commented 15 hours ago

good to know. thanks!