Closed replikeit closed 1 month ago
Parquet example:
❯ parquet meta ~/Downloads/tokens_parquet2_ethereum_tokens_000000000000.parquet 20:23:49
File path: /Users/alinaglumova/Downloads/tokens_parquet2_ethereum_tokens_000000000000.parquet
Created by: parquet-cpp-arrow version 13.0.0
Properties: (none)
Schema:
message schema {
required binary address (STRING);
optional binary symbol (STRING);
optional binary name (STRING);
optional binary decimals (STRING);
optional binary total_supply (STRING);
required int64 block_timestamp (TIMESTAMP(MICROS,false));
required int64 block_number;
required binary block_hash (STRING);
}
Row group 0: count: 1068 159.64 B records start: 4 total(compressed): 166.502 kB total(uncompressed):166.502 kB
--------------------------------------------------------------------------------
type encodings count avg size nulls min / max
address BINARY _ _ R 1068 47.51 B 0 "0x005c97569a24303e9ba6de6..." / "0xffffe5b9cb42b4996997c92..."
symbol BINARY _ _ R 1068 6.21 B 10 "" / "��"
name BINARY _ _ R 1068 10.27 B 10 "" / "����������"
decimals BINARY _ _ R 1068 0.47 B 65 "0" / "9"
total_supply BINARY _ _ R 1068 6.24 B 9 "0" / "9999999999999999999900000..."
block_timestamp INT64 _ _ R 1068 9.32 B 0 "2024-02-08T15:50:47.000000" / "2024-09-11T06:57:23.000000"
block_number INT64 _ _ R 1068 9.32 B 0 "19184445" / "20725728"
block_hash BINARY _ _ R 1068 70.31 B 0 "0x0002376d87ff1bbe5310679..." / "0xffae2542617a1ee9204fb27..."
Hi there,
We've released a new version today, v8.1.10. We made some major changes to the parquet handling in order to resolve this issue with large parquet files.
https://github.com/lensesio/stream-reactor/releases/tag/8.1.10
Please would you give it a go, if this is still something that is useful to you, and report back?
Kind regards
David.
What version of the Stream Reactor are you reporting this issue for?
Release 8.1.4
Are you running the correct version of Kafka/Confluent for the Stream Reactor release?
I am running on Aiven Apache Kafka 3.8.0. My Kafka Connect is deployed using Strimzi on Kubernetes.
Do you have a supported version of the data source/sink .i.e Cassandra 3.0.9?
Yes, I am using GCS (Google Cloud Storage) as the data source and Kafka as the sink.
Have you read the docs?
Yes, I have read the documentation.
What is the expected behaviour?
I expect the connector to transfer Parquet files from GCS to a Kafka topic.
What was observed?
I encountered the following error:
java.io.EOFException: Reached the end of stream with 8861 bytes left to read
What is your Connect cluster configuration (connect-avro-distributed.properties)?
What is your connector properties configuration (my-connector.properties)?
Please provide full log files (redact and sensitive information)