trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
9.83k stars 2.85k forks source link

Create completed query JOIN metadata for QueryCompleteEvent #12433

Open mosiac1 opened 2 years ago

mosiac1 commented 2 years ago

The aim of this enhancement is to provide more powerful metadata from Trino query events. For my particular use case, I plan on ingesting this into OpenMetadata.

I'm interested in gathering data about how tables in Trino are joined together in a query, more generally, how data from 2 different tables is combined. This can be from joins (LEFT JOIN, RIGHT JOIN, INNER JOIN ON ...), semi-joins, index-joins, unions, selects from multiple tables, etc. This data already exists in Trino in the form of the StageInfo and ParseNode trees describing how a query will be processed.

QueryCompleteEvent.QueryMetadata.payload hold the StageInfo for a query in json serialized format. This data can be logged using the event listener. From this, I built a prototype in python that uses a visitor to gather joins between data sources (TableScanNodes) and translate them to joins between the fully qualified table and column names. This works as a POC, but it's not usable for production because of the following main limitations:

  1. TableScan.assignments is of type Map<Symbol, ColumnHandle>; because each connector implements ColumnHandle there is no way of getting a column's name (outside of guessing)
  2. The script is dependent on the format of StageInfo and all other classes used by it, so changes to Trino can easily break it.

From my point of view, this seems like something that should live inside Trino because of how tied it is to Trino's inner working.

The JOIN data can be held by QueryMetadata and the processing (QueryJoin createQueryJoins(QueryInfo)) can live inside QueryMonitor.

What does the community think of a feature like this? I'm more than glad to work on this.

findepi commented 2 years ago

cc @sopel39

sopel39 commented 2 years ago

We are planning to expose anonymized query plan to even listener, see https://github.com/trinodb/trino/pull/12345, cc @gaurav8297 @raunaqmorarka

mosiac1 commented 2 years ago

Thanks for the response!

The idea is excellent but it doesn't quite apply to the use case I'm thinking of. All info exposed by anonymized query plans is already available in the json dump of QueryInfo#getOutputStage(), but more complex and along with other (maybe unwanted) data ( I may be wrong, please correct me ). What I'm looking for is having Trino distill down query plans into easy to understand metadata about how tables relate to each other; abstracting away Trino's internal query planning

sopel39 commented 2 years ago

What I'm looking for is having Trino distill down query plans into easy to understand metadata about how tables relate to each other; abstracting away Trino's internal query planning

Yes, we want anonymized query plans to be in JSON representation, so it should be possible to easily parse them