Closed cellsummer closed 2 months ago
Hi:
pyarrow.parquet
? This this to check: https://github.com/slingdata-io/sling-cli/blob/main/core/dbio/scripts/check_parquet.pyI just released 1.2.2
, and did a last minute upgrade for github.com/apache/arrow/go/v16
to its latest version.
Can you try it?
the issue seems to be fixed in 1.2.2
when I run a single file. But now there seems to be another issue in the replication mode (where it hangs at the read data step).
2024-03-31 17:44:23 DBG processing wildcards for S3
2024-03-31 17:44:24 INF Sling Replication [9 streams] | S3 -> LOCAL
2024-03-31 17:44:24 INF [1 / 9] running stream s3://<my-bucket>/data.parquet
2024-03-31 17:44:24 DBG Sling version: 1.2.2 (windows amd64)
2024-03-31 17:44:24 DBG type is file-file
2024-03-31 17:44:24 DBG using source options: {"trim_space":false,"empty_as_null":true,"header":true,"compression":"AUTO","null_if":"NULL","datetime_format":"AUTO","skip_blank_lines":false,"delimiter":",","max_decimals":-1}
2024-03-31 17:44:24 DBG using target options: {"header":true,"compression":"AUTO","concurrency":7,"datetime_format":"auto","delimiter":",","max_decimals":-1,"use_bulk":true,"add_new_columns":true,"adjust_column_type":false,"column_casing":"source"}
2024-03-31 17:44:24 INF reading from source file system (s3)
2024-03-31 17:44:24 DBG reading datastream from s3://<my-bucket>/data.parquet [format=parquet]
2024-03-31 17:44:24 DBG downloading to temp file on disk: C:/<my-user>/AppData/Local/Temp/parquet.temp.1711921464764.6EI.parquet
2024-03-31 17:44:25 DBG wrote 133667 bytes to C:/<my-user>AppData/Local/Temp/parquet.temp.1711921464764.6EI.parquet
Ah, do you mind sharing your replication yaml?
On Sun, Mar 31, 2024, 6:48 PM Wenjing Fang @.***> wrote:
the issue seems to be fixed in 1.2.2 when I run a single file. But now there seems to be another issue in the replication mode (where it hangs at the read data step).
2024-03-31 17:44:23 DBG processing wildcards for S3 2024-03-31 17:44:24 INF Sling Replication [9 streams] | S3 -> LOCAL
2024-03-31 17:44:24 INF [1 / 9] running stream s3://
/data.parquet 2024-03-31 17:44:24 DBG Sling version: 1.2.2 (windows amd64) 2024-03-31 17:44:24 DBG type is file-file 2024-03-31 17:44:24 DBG using source options: {"trim_space":false,"empty_as_null":true,"header":true,"compression":"AUTO","null_if":"NULL","datetime_format":"AUTO","skip_blank_lines":false,"delimiter":",","max_decimals":-1} 2024-03-31 17:44:24 DBG using target options: {"header":true,"compression":"AUTO","concurrency":7,"datetime_format":"auto","delimiter":",","max_decimals":-1,"use_bulk":true,"add_new_columns":true,"adjust_column_type":false,"column_casing":"source"} 2024-03-31 17:44:24 INF reading from source file system (s3) 2024-03-31 17:44:24 DBG reading datastream from s3:// /data.parquet [format=parquet] 2024-03-31 17:44:24 DBG downloading to temp file on disk: C:/ /AppData/Local/Temp/parquet.temp.1711921464764.6EI.parquet 2024-03-31 17:44:25 DBG wrote 133667 bytes to C:/ AppData/Local/Temp/parquet.temp.1711921464764.6EI.parquet — Reply to this email directly, view it on GitHub https://github.com/slingdata-io/sling-cli/issues/243#issuecomment-2028913110, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2QZYTP46GDWCTJGMGWWU3Y3CAB5AVCNFSM6AAAAABFQTTRISVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRYHEYTGMJRGA . You are receiving this because you commented.Message ID: @.***>
Works fine for me on mac & windows:
source: aws_s3
target: local
defaults:
mode: full-refresh
object: 'file:///tmp/test1.parquet'
streams:
test1.parquet:
2024-03-31 19:56:36 INF Sling Replication [1 streams] | aws_s3 -> local
2024-03-31 19:56:36 INF [1 / 1] running stream test1.parquet
2024-03-31 19:56:36 DBG Sling version: 1.2.2 (darwin arm64)
2024-03-31 19:56:36 DBG type is file-file
2024-03-31 19:56:36 DBG using source options: {"trim_space":false,"empty_as_null":true,"header":true,"compression":"AUTO","null_if":"NULL","datetime_format":"AUTO","skip_blank_lines":false,"delimiter":",","max_decimals":-1}
2024-03-31 19:56:36 DBG using target options: {"header":true,"compression":"AUTO","concurrency":7,"datetime_format":"auto","delimiter":",","max_decimals":-1,"use_bulk":true,"add_new_columns":true,"adjust_column_type":false,"column_casing":"source"}
2024-03-31 19:56:36 INF reading from source file system (s3)
2024-03-31 19:56:36 DBG reading datastream from s3://bucket/test1.parquet [format=parquet]
2024-03-31 19:56:36 DBG downloading to temp file on disk: /var/folders/49/1zc24t595j79t5mw7_t9gtxr0000gn/T/parquet.temp.1711925796949.oZb.parquet
2024-03-31 19:56:37 DBG wrote 48871 bytes to /var/folders/49/1zc24t595j79t5mw7_t9gtxr0000gn/T/parquet.temp.1711925796949.oZb.parquet
2024-03-31 19:56:38 INF writing to target file system (file)
2024-03-31 19:56:38 DBG writing to file:///tmp/test1.parquet [fileRowLimit=0 fileBytesLimit=0 compression=AUTO concurrency=7 useBufferedStream=false fileFormat=parquet]
2024-03-31 19:56:38 DBG wrote 49 kB: 1000 rows [493 r/s]
2024-03-31 19:56:38 INF wrote 1000 rows to file:///tmp/test1.parquet [493 r/s]
2024-03-31 19:56:38 INF execution succeeded
2024-03-31 19:56:38 INF Sling Replication Completed in 2s | aws_s3 -> local | 1 Successes | 0 Failures
thanks. I will try to get more details tomorrow. Here is my replication yaml
source: S3
target: DUCKDB
# default config options which apply to all streams
defaults:
mode: full-refresh
object: '{stream_file_folder}_{stream_file_name}'
streams:
"s3://<my-bucket>/*.parquet":
Closing for now. Re-open if needed.
The following errors happen when reading certain parquet sources. I have a bunch of parquet files all written from the same application. It's strange that Sling could read some of them but get the following error for the rest.