slingdata-io / sling-cli

Sling is a CLI tool that extracts data from a source storage/database and loads it in a target storage/database.
https://docs.slingdata.io
GNU General Public License v3.0
301 stars 16 forks source link

runtime error when reading parquet files #243

Closed cellsummer closed 2 months ago

cellsummer commented 3 months ago

The following errors happen when reading certain parquet sources. I have a bunch of parquet files all written from the same application. It's strange that Sling could read some of them but get the following error for the rest.  

panic: runtime error: index out of range [0] with length 0                                                                                                                                                                                                                                                                                                                                                                                      goroutine 95 [running]:
github.com/apache/arrow/go/v16/parquet/internal/utils.unpack32Avx2({0x7ff71458d0a0, 0xc001a4b080}, {0xc00174e000, 0x7?, 0x1?}, 0x0)                                                                                             C:/Users/runneradmin/go/pkg/mod/github.com/apache/arrow/go/v16@v16.0.0-20240215131144-a03d957b5b8d/parquet/internal/utils/bit_packing_avx2_amd64.go:48 +0x228
github.com/apache/arrow/go/v16/parquet/internal/utils.(*BitReader).GetBatchIndex(0xc001d37900, 0x0, {0xc00174e000, 0x80, 0x400})                                                                                                C:/Users/runneradmin/go/pkg/mod/github.com/apache/arrow/go/v16@v16.0.0-20240215131144-a03d957b5b8d/parquet/internal/utils/bit_reader.go:230 +0x1be
github.com/apache/arrow/go/v16/parquet/internal/utils.(*RleDecoder).GetBatchWithDictByteArray(0xc0017f26e0, {0x7ff7145b6958, 0xc001d24ec0}, {0xc001a75800, 0x80, 0x80})                                                         C:/Users/runneradmin/go/pkg/mod/github.com/apache/arrow/go/v16@v16.0.0-20240215131144-a03d957b5b8d/parquet/internal/utils/typed_rle_dict.gen.go:1163 +0x136
github.com/apache/arrow/go/v16/parquet/internal/utils.(*RleDecoder).GetBatchWithDict(0x7ff710e6eefd?, {0x7ff7145b6958?, 0xc001d24ec0?}, {0x7ff713597a00?, 0xc00174d9c0?})                                                       C:/Users/runneradmin/go/pkg/mod/github.com/apache/arrow/go/v16@v16.0.0-20240215131144-a03d957b5b8d/parquet/internal/utils/rle.go:423 +0x169
github.com/apache/arrow/go/v16/parquet/internal/encoding.(*dictDecoder).decode(...)                                                                                                                                             C:/Users/runneradmin/go/pkg/mod/github.com/apache/arrow/go/v16@v16.0.0-20240215131144-a03d957b5b8d/parquet/internal/encoding/decoder.go:146
github.com/apache/arrow/go/v16/parquet/internal/encoding.(*DictByteArrayDecoder).Decode(0xc001122200, {0xc001a75800?, 0x7ff7139e0b40?, 0x7ff71458cf01?})                                                                        C:/Users/runneradmin/go/pkg/mod/github.com/apache/arrow/go/v16@v16.0.0-20240215131144-a03d957b5b8d/parquet/internal/encoding/typed_encoder.gen.go:1369 +0x67
github.com/apache/arrow/go/v16/parquet/file.(*ByteArrayColumnChunkReader).ReadBatch.func1(0x0, 0x80)                                                                                                                            C:/Users/runneradmin/go/pkg/mod/github.com/apache/arrow/go/v16@v16.0.0-20240215131144-a03d957b5b8d/parquet/file/column_reader_types.gen.go:263 +0xab
github.com/apache/arrow/go/v16/parquet/file.(*columnChunkReader).readBatch(0xc001a58d80, 0x80, {0xc000782300, 0x80, 0x80}, {0xc000782400, 0x80, 0x80}, 0xc00174db20)                                                            C:/Users/runneradmin/go/pkg/mod/github.com/apache/arrow/go/v16@v16.0.0-20240215131144-a03d957b5b8d/parquet/file/column_reader.go:514 +0xc2
github.com/apache/arrow/go/v16/parquet/file.(*ByteArrayColumnChunkReader).ReadBatch(0x7ff7118950ad?, 0xc001a58d80?, {0xc001a75800?, 0xc000aff680?, 0xe9?}, {0xc000782300?, 0x7ff7118b4bf8?, 0xc001a58d80?}, {0xc000782400, 0x80, ...})                                                                                                                                                                                                                  C:/Users/runneradmin/go/pkg/mod/github.com/apache/arrow/go/v16@v16.0.0-20240215131144-a03d957b5b8d/parquet/file/column_reader_types.gen.go:262 +0x7d
github.com/slingdata-io/sling-cli/core/dbio/iop.(*ParquetArrowDumper).readNextBatch(0xc001dbc240)                                                                                                                               D:/a/sling-cli/sling-cli/core/dbio/iop/parquet_arrow.go:374 +0x34a
github.com/slingdata-io/sling-cli/core/dbio/iop.(*ParquetArrowDumper).Next(0xc001dbc240)                                                                                                                                        D:/a/sling-cli/sling-cli/core/dbio/iop/parquet_arrow.go:393 +0x52
github.com/slingdata-io/sling-cli/core/dbio/iop.(*ParquetArrowReader).readRowsLoop(0xc00141b0e0)                                                                                                                                D:/a/sling-cli/sling-cli/core/dbio/iop/parquet_arrow.go:232 +0x63b
created by github.com/slingdata-io/sling-cli/core/dbio/iop.NewParquetArrowReader in goroutine 94                                                                                                                                D:/a/sling-cli/sling-cli/core/dbio/iop/parquet_arrow.go:74 +0x305
flarco commented 3 months ago

Hi:

flarco commented 3 months ago

I just released 1.2.2, and did a last minute upgrade for github.com/apache/arrow/go/v16 to its latest version. Can you try it?

cellsummer commented 3 months ago

the issue seems to be fixed in 1.2.2 when I run a single file. But now there seems to be another issue in the replication mode (where it hangs at the read data step).

2024-03-31 17:44:23 DBG processing wildcards for S3
2024-03-31 17:44:24 INF Sling Replication [9 streams] | S3 -> LOCAL

2024-03-31 17:44:24 INF [1 / 9] running stream s3://<my-bucket>/data.parquet
2024-03-31 17:44:24 DBG Sling version: 1.2.2 (windows amd64)
2024-03-31 17:44:24 DBG type is file-file
2024-03-31 17:44:24 DBG using source options: {"trim_space":false,"empty_as_null":true,"header":true,"compression":"AUTO","null_if":"NULL","datetime_format":"AUTO","skip_blank_lines":false,"delimiter":",","max_decimals":-1}
2024-03-31 17:44:24 DBG using target options: {"header":true,"compression":"AUTO","concurrency":7,"datetime_format":"auto","delimiter":",","max_decimals":-1,"use_bulk":true,"add_new_columns":true,"adjust_column_type":false,"column_casing":"source"}
2024-03-31 17:44:24 INF reading from source file system (s3)
2024-03-31 17:44:24 DBG reading datastream from s3://<my-bucket>/data.parquet [format=parquet]
2024-03-31 17:44:24 DBG downloading to temp file on disk: C:/<my-user>/AppData/Local/Temp/parquet.temp.1711921464764.6EI.parquet
2024-03-31 17:44:25 DBG wrote 133667 bytes to C:/<my-user>AppData/Local/Temp/parquet.temp.1711921464764.6EI.parquet
flarco commented 3 months ago

Ah, do you mind sharing your replication yaml?

On Sun, Mar 31, 2024, 6:48 PM Wenjing Fang @.***> wrote:

the issue seems to be fixed in 1.2.2 when I run a single file. But now there seems to be another issue in the replication mode (where it hangs at the read data step).

2024-03-31 17:44:23 DBG processing wildcards for S3 2024-03-31 17:44:24 INF Sling Replication [9 streams] | S3 -> LOCAL

2024-03-31 17:44:24 INF [1 / 9] running stream s3:///data.parquet 2024-03-31 17:44:24 DBG Sling version: 1.2.2 (windows amd64) 2024-03-31 17:44:24 DBG type is file-file 2024-03-31 17:44:24 DBG using source options: {"trim_space":false,"empty_as_null":true,"header":true,"compression":"AUTO","null_if":"NULL","datetime_format":"AUTO","skip_blank_lines":false,"delimiter":",","max_decimals":-1} 2024-03-31 17:44:24 DBG using target options: {"header":true,"compression":"AUTO","concurrency":7,"datetime_format":"auto","delimiter":",","max_decimals":-1,"use_bulk":true,"add_new_columns":true,"adjust_column_type":false,"column_casing":"source"} 2024-03-31 17:44:24 INF reading from source file system (s3) 2024-03-31 17:44:24 DBG reading datastream from s3:///data.parquet [format=parquet] 2024-03-31 17:44:24 DBG downloading to temp file on disk: C://AppData/Local/Temp/parquet.temp.1711921464764.6EI.parquet 2024-03-31 17:44:25 DBG wrote 133667 bytes to C:/AppData/Local/Temp/parquet.temp.1711921464764.6EI.parquet

— Reply to this email directly, view it on GitHub https://github.com/slingdata-io/sling-cli/issues/243#issuecomment-2028913110, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2QZYTP46GDWCTJGMGWWU3Y3CAB5AVCNFSM6AAAAABFQTTRISVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRYHEYTGMJRGA . You are receiving this because you commented.Message ID: @.***>

flarco commented 3 months ago

Works fine for me on mac & windows:

source: aws_s3
target: local

defaults:
  mode: full-refresh
  object: 'file:///tmp/test1.parquet'

streams:
  test1.parquet:
2024-03-31 19:56:36 INF Sling Replication [1 streams] | aws_s3 -> local

2024-03-31 19:56:36 INF [1 / 1] running stream test1.parquet
2024-03-31 19:56:36 DBG Sling version: 1.2.2 (darwin arm64)
2024-03-31 19:56:36 DBG type is file-file
2024-03-31 19:56:36 DBG using source options: {"trim_space":false,"empty_as_null":true,"header":true,"compression":"AUTO","null_if":"NULL","datetime_format":"AUTO","skip_blank_lines":false,"delimiter":",","max_decimals":-1}
2024-03-31 19:56:36 DBG using target options: {"header":true,"compression":"AUTO","concurrency":7,"datetime_format":"auto","delimiter":",","max_decimals":-1,"use_bulk":true,"add_new_columns":true,"adjust_column_type":false,"column_casing":"source"}
2024-03-31 19:56:36 INF reading from source file system (s3)
2024-03-31 19:56:36 DBG reading datastream from s3://bucket/test1.parquet [format=parquet]
2024-03-31 19:56:36 DBG downloading to temp file on disk: /var/folders/49/1zc24t595j79t5mw7_t9gtxr0000gn/T/parquet.temp.1711925796949.oZb.parquet
2024-03-31 19:56:37 DBG wrote 48871 bytes to /var/folders/49/1zc24t595j79t5mw7_t9gtxr0000gn/T/parquet.temp.1711925796949.oZb.parquet
2024-03-31 19:56:38 INF writing to target file system (file)
2024-03-31 19:56:38 DBG writing to file:///tmp/test1.parquet [fileRowLimit=0 fileBytesLimit=0 compression=AUTO concurrency=7 useBufferedStream=false fileFormat=parquet]
2024-03-31 19:56:38 DBG wrote 49 kB: 1000 rows [493 r/s]
2024-03-31 19:56:38 INF wrote 1000 rows to file:///tmp/test1.parquet [493 r/s]
2024-03-31 19:56:38 INF execution succeeded

2024-03-31 19:56:38 INF Sling Replication Completed in 2s | aws_s3 -> local | 1 Successes | 0 Failures
cellsummer commented 3 months ago

thanks. I will try to get more details tomorrow. Here is my replication yaml

source: S3
target: DUCKDB

# default config options which apply to all streams
defaults:
  mode: full-refresh
  object: '{stream_file_folder}_{stream_file_name}'

streams:
  "s3://<my-bucket>/*.parquet":
flarco commented 2 months ago

Closing for now. Re-open if needed.