Trino Vault demo needs to demonstrate waterfall capability

os-climate / data-platform-demo

Apache License 2.0

3 stars 7 forks source link

Trino Vault demo needs to demonstrate waterfall capability #37

Open MichaelTiemannOSC opened 2 years ago

MichaelTiemannOSC commented 2 years ago

Users value the ability to select which providers will supply what data, and to be able to compare and contrast the results of using one data source in preference to another. This example will be elaborated as needed.

The Trino vault demo should make it possible for users to select from among three providers A, B, and C data for metrics U, V, and W to, compute results X and Y. Think of the waterfall as a lambda function that selects the "best" source (for some definition of best) from among a set of available choices.

This should build on https://github.com/os-climate/data-platform-demo/issues/36. The goal is to show what it would look like for data providers (A, B, and C) to provision data to be used in a waterfall prioritization, what it would be like for the user to specify a preference, and how the platform performs the computation while enforcing data access management and preserving data lineage.

erikerlandson commented 2 years ago

suspect that this kind of biz logic will be implemented either in data+sql, for example: https://trino.io/docs/current/functions/conditional.html#case

or in the surrounding code (e.g. python)

However, this will have some interaction with data-vault concepts, since which db values are available during a query may be a function of which columns or rows the user can see.

Note that row-level access will tend to "fail" silently - by returning fewer results, or none. Attempting to access a column that one does not have permissions to will result in a sql failure exception. So there is some asymmetry in the logic depending on whether one is restricted by row axis versus column axis.

MichaelTiemannOSC commented 2 years ago

For row-level data: COALESCE (ftw)

For column-level data, you are right. But I also think that it's the row-level case that makes the most sense for the waterfall.

MichaelTiemannOSC commented 2 years ago

To Erik's point that the implementation of a waterfall really is something that happens more on the user front than the data provider front, the logical facility would be to create MATERIALIZED VIEWS that do the COALESCE operations. But not all connectors support: https://github.com/os-climate/data-platform-demo/issues/35

MichaelTiemannOSC commented 2 years ago

Thinking more about this in the context of the data pipeline architecture and the concept of precomputed data, it occurs to me that perhaps the most prominent use case would be to effect the waterfall as part of the initial data processing phase, post ingest and pre-analysis. A user would stipulate a particular waterfall prioritization and the system would then build and validate the data for that waterfall. If the user wants to change the prioritization, a new dataset would be built and validated according to the new rules. What would NOT happen is a delay in selecting data until the moment the analytic tool needs a specific piece of data--that's too late for validation and lineage maintenance.

If the above is agreed, I can make that a next step in the prototype I'm presently building.

@caldeirav