Open marcinsiennicki95 opened 4 months ago
Pinging code owners:
exporter/file: @atingchen
See Adding Labels via Comments if you do not have permissions to add labels yourself.
@jmacd Is it possible with current stat of arrow, because I found in documentation.
https://github.com/open-telemetry/otel-arrow
@marcinsiennicki95 there is a connection between Arrow and Parquet, but it is not an automatic translation. The way we have structured the OTel-Arrow data stream, there are multiple logical tables being exchanged within an Arrow IPC payload, both because of varying schemas within the telemetry and because of shared data references. These multiple logical tables would naturally translate into multiple Parquet files.
When writing tables of shared data across an OTel-Arrow stream, the OTel-Arrow components will repeat shared data once per stream - while in a database system it would be possible to refer to past data in the system. The tradeoffs involved between writing across the network and constructing a database are large, so to make progress on this issue we would have to settle on what the Parquet schema looks like.
cc/ @lquerel
(Teaser: I've been playing around with an Parquet-first telemetry data store, it's helped me come to concrete opinions about this problem. https://github.com/jmacd/duckpond)
Thx for answer. I had a conversation on the OpenTelemetry Slack channel and found out that @atoulme was working on the Parquet format
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers
. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
Not anymore. As noted, the parquetexporter was not adopted, and we are working on Apache Arrow instead.
Component(s)
exporter/file
Is your feature request related to a problem? Please describe.
Parquet Format: Parquet is a columnar storage file format optimized for big data processing frameworks. It provides efficient data compression and encoding schemes, enhancing performance and reducing storage costs. Telemetry data written in Parquet format is stored in columns, making it faster to read and query specific fields.
Delta Format: Delta Lake is an open-source storage layer that brings ACID transactions to big data workloads. Delta format combines the reliability of data lakes with the performance of data warehouses. Writing telemetry data in Delta format allows for scalable and reliable data processing, supporting complex data pipelines and real-time analytics.
Describe the solution you'd like
Ability to write in Parquet or Delta format
Describe alternatives you've considered
No response
Additional context
No response