trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.35k stars 2.98k forks source link

Enable a pub-sub mechanism for cross-plugin, plugin-core communication #13298

Open Riddle4045 opened 2 years ago

Riddle4045 commented 2 years ago

Scenario

We have a plugin that logs QueryCompleted, QueryCreated and SplitCompleted events to HDFS as orc files - we mount external tables and views on top of this data which is very useful in Ops management.

We are in the process of writing Adoption RangerSysntemAccessControl plugin and want to re-use the plugin mentioned earlier to log Audits event to HDFS. Currently the only solution is to bundle and ship the plugins together.

It would be great to have the EventListener framework in core engine be a pub-sub which will allow Broadcasting, Multicasting or even peer-to-peer communication between plugins/coreengine.

Slack thread - https://trinodb.slack.com/archives/CP1MUNEUX/p1658428579937089

Riddle4045 commented 2 years ago

Related reviews: #13297

mosiac1 commented 2 years ago

Hello!

I build the http-event-listener (https://github.com/trinodb/trino/tree/master/plugin/trino-http-event-listener).

We are using it inside Bloomberg to collect usage analytics for our Trinos. The pipeline goes:

Trino (trough the event listener) -> (HTTP POST) https://github.com/bloomberg/datalake-query-ingester -> (PUBLISH) Kafka Pipe -> (CONSUME) https://github.com/bloomberg/datalake-query-ingester -> Postgres. The events (this supports QueryComplete events only) go through processing so they fit a relational database - so we end up with data that is quite easy to understand/use even by people that are not Trino admins necessarily.

In your case, this may use another service instead of datalake-query-ingester or skip the Kafka pipe entirely (we use it more for reliability and for the option of having more services hook into it).

The system is designed to be quite loosely connectiong, so that it may easily hook up into existing systems and be extensible with pieces that need to stay internal.

From this, we get a full "audit" because we can easily track every column access for every user - reads, writes, executes. This also gets us a lot of usage metrics and metadata.

The EventListener interface could get new handlers for special Audit events, which can be more in-depth and originate from the actual SystemAccessControll. This would be very useful to track access denies or to monitor sensitive data, and with this pipeline, I imagine it would be not too difficult to connect to existing ticketing and alerting systems.

Riddle4045 commented 2 years ago

@mosiac1 Thanks for the details! I am trying to address a more generic problem, SystemAccessControl is an example scenario that demonstrates the gap.

I am interested in a framework to enable two-way communication between the engine/plugins & across plugins. Read my latest comment on the slack thread I added to the description.