trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.49k stars 3.02k forks source link

Write Delta Lake "operationMetrics" Transaction Log Field #12005

Open homar opened 2 years ago

homar commented 2 years ago

Delta Lake has a commit field called operationMetrics that had some statistics on the rows deleted. It's not in the protocol definition but it could be useful to include. See DeltaLakeMetadata

findepi commented 2 years ago

It's not in the protocol definition

what should go into this field then?

cc @vkorukanti

vkorukanti commented 2 years ago

@findepi These are the operation metrics for each operation. Let me get back to you on whether these should be part of the Protocol.

findepi commented 2 years ago

.. whether these should be part of the Protocol.

cc @claudiusli

also cc @alexjo2144 @ilfrin

alexjo2144 commented 2 years ago

@vkorukanti any new thoughts on this w.r.t. https://databricks.com/blog/2022/06/30/open-sourcing-all-of-delta-lake.html ?

vkorukanti commented 2 years ago

Apologies for not getting back on time. The Delta-on-Spark opensource project already has metrics defined here written as part of the commit. Regarding whether they should be part of the protocol: ideally they should be, we haven't documented them yet. These are evolving frequently based on the need. Also these metrics are currently a bag of json fields, so any implementation expected to handle missing fields or extra fields.

findinpath commented 1 year ago

The operation metrics are also listed in the $history metadata table

https://trino.io/docs/current/connector/delta-lake.html#history-table

https://github.com/trinodb/trino/blob/160400af85d2bef4e93d746fbb2abbf249581b2c/plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeHistoryTable.java#L69