open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.1k stars 2.39k forks source link

[exporter/file] Add posibility to write telemetry in Parquet or Delta format #33807

Open marcinsiennicki95 opened 4 months ago

marcinsiennicki95 commented 4 months ago

Component(s)

exporter/file

Is your feature request related to a problem? Please describe.

Parquet Format: Parquet is a columnar storage file format optimized for big data processing frameworks. It provides efficient data compression and encoding schemes, enhancing performance and reducing storage costs. Telemetry data written in Parquet format is stored in columns, making it faster to read and query specific fields.

Delta Format: Delta Lake is an open-source storage layer that brings ACID transactions to big data workloads. Delta format combines the reliability of data lakes with the performance of data warehouses. Writing telemetry data in Delta format allows for scalable and reliable data processing, supporting complex data pipelines and real-time analytics.

Describe the solution you'd like

Ability to write in Parquet or Delta format

Describe alternatives you've considered

No response

Additional context

No response

github-actions[bot] commented 4 months ago

Pinging code owners:

marcinsiennicki95 commented 4 months ago

@jmacd Is it possible with current stat of arrow, because I found in documentation.

https://github.com/open-telemetry/otel-arrow

  1. Output OpenTelemetry data to the Parquet file format, part of the Apache Arrow ecosystem
jmacd commented 4 months ago

@marcinsiennicki95 there is a connection between Arrow and Parquet, but it is not an automatic translation. The way we have structured the OTel-Arrow data stream, there are multiple logical tables being exchanged within an Arrow IPC payload, both because of varying schemas within the telemetry and because of shared data references. These multiple logical tables would naturally translate into multiple Parquet files.

When writing tables of shared data across an OTel-Arrow stream, the OTel-Arrow components will repeat shared data once per stream - while in a database system it would be possible to refer to past data in the system. The tradeoffs involved between writing across the network and constructing a database are large, so to make progress on this issue we would have to settle on what the Parquet schema looks like.

cc/ @lquerel

jmacd commented 4 months ago

(Teaser: I've been playing around with an Parquet-first telemetry data store, it's helped me come to concrete opinions about this problem. https://github.com/jmacd/duckpond)

marcinsiennicki95 commented 4 months ago

Thx for answer. I had a conversation on the OpenTelemetry Slack channel and found out that @atoulme was working on the Parquet format

github-actions[bot] commented 2 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

atoulme commented 1 month ago

Not anymore. As noted, the parquetexporter was not adopted, and we are working on Apache Arrow instead.