Open mosiac1 opened 2 years ago
cc @sopel39
We are planning to expose anonymized query plan to even listener, see https://github.com/trinodb/trino/pull/12345, cc @gaurav8297 @raunaqmorarka
Thanks for the response!
The idea is excellent but it doesn't quite apply to the use case I'm thinking of. All info exposed by anonymized query plans is already available in the json dump of QueryInfo#getOutputStage()
, but more complex and along with other (maybe unwanted) data ( I may be wrong, please correct me ). What I'm looking for is having Trino distill down query plans into easy to understand metadata about how tables relate to each other; abstracting away Trino's internal query planning
What I'm looking for is having Trino distill down query plans into easy to understand metadata about how tables relate to each other; abstracting away Trino's internal query planning
Yes, we want anonymized query plans to be in JSON representation, so it should be possible to easily parse them
The aim of this enhancement is to provide more powerful metadata from Trino query events. For my particular use case, I plan on ingesting this into OpenMetadata.
I'm interested in gathering data about how tables in Trino are joined together in a query, more generally, how data from 2 different tables is combined. This can be from joins (
LEFT JOIN
,RIGHT JOIN
,INNER JOIN ON ...
), semi-joins, index-joins, unions, selects from multiple tables, etc. This data already exists in Trino in the form of theStageInfo
andParseNode
trees describing how a query will be processed.QueryCompleteEvent.QueryMetadata.payload
hold theStageInfo
for a query in json serialized format. This data can be logged using the event listener. From this, I built a prototype in python that uses a visitor to gather joins between data sources (TableScanNode
s) and translate them to joins between the fully qualified table and column names. This works as a POC, but it's not usable for production because of the following main limitations:TableScan.assignments
is of typeMap<Symbol, ColumnHandle>
; because each connector implementsColumnHandle
there is no way of getting a column's name (outside of guessing)StageInfo
and all other classes used by it, so changes to Trino can easily break it.From my point of view, this seems like something that should live inside Trino because of how tied it is to Trino's inner working.
The JOIN data can be held by
QueryMetadata
and the processing (QueryJoin createQueryJoins(QueryInfo)
) can live insideQueryMonitor
.What does the community think of a feature like this? I'm more than glad to work on this.