risingwavelabs / risingwave

Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming and batch. PostgreSQL compatible.
https://go.risingwave.com/slack
Apache License 2.0
7.06k stars 581 forks source link

Athena cannot query the iceberg files sinked by Risingwave #19459

Closed lmatz closed 2 days ago

lmatz commented 2 days ago

Describe the bug

The file can be read by Risignwave's iceberg source. But it cannot be queried by Athena on AWS

Error message/log

No response

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

v2.0.3

Additional context

No response

darkcofy commented 2 days ago

steps to reproduce:

CREATE SOURCE iceberg_sink_source (
     seq_id bigint,
     user_id bigint,
     user_name varchar)
WITH (
     connector = 'datagen',
     fields.seq_id.kind = 'sequence',
     fields.seq_id.start = '1',
     fields.seq_id.end = '10000000',
     fields.user_id.kind = 'random',
     fields.user_id.min = '1',
     fields.user_id.max = '10000000',
     fields.user_name.kind = 'random',
     fields.user_name.length = '10',
     datagen.rows.per.second = '20000'
 ) FORMAT PLAIN ENCODE JSON;

CREATE TABLE iceberg_sink_table (
     seq_id bigint,
     user_id bigint,
     user_name varchar)
WITH (
     connector = 'datagen',
     fields.seq_id.kind = 'sequence',
     fields.seq_id.start = '1',
     fields.seq_id.end = '10000000',
     fields.user_id.kind = 'random',
     fields.user_id.min = '1',
     fields.user_id.max = '10000000',
     fields.user_name.kind = 'random',
     fields.user_name.length = '10',
     datagen.rows.per.second = '20000'
 ) FORMAT PLAIN ENCODE JSON;

CREATE SINK risingwave_iceberg_sink_test FROM iceberg_sink_table
with (
      type='upsert',
      primary_key='seq_id',
      connector = 'iceberg',
      catalog.type = 'glue',
      catalog.name = 'awsdatacatalog',
      warehouse.path = 's3://hw-data-platform/',
      s3.access.key = 'xxx',
      s3.secret.key = 'xxxx',
      s3.region = 'eu-west-1',
      database.name='risingwave_sink',
      table.name='iceberg_sink_test',
      create_table_if_not_exists=TRUE
  );

error in athena is :

ICEBERG_CANNOT_OPEN_SPLIT: Error opening Iceberg split s3://hw-data-platform/data/00000-0-75d35ec1-55ad-419b-99cd-229203daa761-00003.parquet (offset=47993391, length=2189420235): Range [-4, -4 + 4) out of bounds for length 0