substrait-io / substrait

A cross platform way to express data transformation, relational algebra, standardized record expression and plans.
https://substrait.io
Apache License 2.0
1.19k stars 155 forks source link

[Proposal] Substrait-to-Atlas: a tool to generate Atlas entities and relations representing data lineage information in Substrait plans #208

Closed ashvina closed 1 year ago

ashvina commented 2 years ago

Background:

  1. Data lineage encodes information that connects datasets and their generation workflows. Or, data lineage is a graph which connects input data objects to output data objects. This information is crucial for data governance platforms, like Apache Atlas.
  2. Apache Atlas is the most popular platform providing open metadata management and governance capabilities. Atlas relies on the query processors to generate and publish lineage information using its REST api. For e.g., SAC is a popular tools to listen to Spark query execution events, extract lineage, and send the extracted information to Atlas. The now “standard” API is even supported by commercial offerings like Microsoft Purview and Google Catalog.

Why Substrait-to-Atlas tool:

Substrait provides a well-defined cross language specification for data compute operations, with a goal to get consistent interpretation of the semantics. The standardization can offer a wide range of “bonus-benefits” by providing a library of tools to enable seamless integration of the data engine with other platforms in the data ecosystem. Out-of-the-box integration with Atlas would be good candidate. In other words, any data engine that communicates using Substrait plans can easily be integrated with Atlas governance platform. As such the data engines can avoid building a custom Atlas hook and duplicating the effort.

How:

I’ve been studying the Substrait specification and playing around with the substrait-java code. My observations is that the information about tables and columns accessed (input-data-entities) and affected (output-data-entities) by the query (process-entity) execution is readily available in Substrait plans. For instance, the “directReference” field in expressions and measures correspond to the input data entities from lineage perspective. This information can be used to construct Atlas entities and input and output relations.

  1. An initial version of the tool could be in developed in Java and could demonstrate the results on TPCH queries (as some of the current unit tests are based on TPCH). After that it can be made smarter incrementally for other benchmarks and implementations in other languages can be provided.
  2. To that end, I’ve started building a prototype which depends on the substrait-java module. BTW, I have prior experience in building lineage extractors and I strongly believe this tool will be useful. If there is interest and similar effort has not been started elsewhere, I can start working on a design doc and start incremental development of the tool.

I am still new to the project, so let me know if there's a preference on where the contribution should be committed.

GavinRay97 commented 2 years ago

I've never heard of Apache Atlas before, but this sounds useful in my naive understanding.

Could you explain for folks like me that don't have experience with these tools roughly how the process for extracting and using this information would go?

Like, how does Atlas interact with Substrait plans to create this lineage information?

curino commented 2 years ago

@ashvina is best to discuss the details. My <2cents> is that this is very useful. Atlas is very popular for governance/lineage tracking. This integration would make it particularly easy for any system that integrate with Substrait to get high quality lineage extraction for cheap. As we work on making Substrait widely accepted I think this type of tooling/integrations are very important to increase the overall "why do I pick substrait" value prop.

jacques-n commented 2 years ago

This is an exciting proposal. Any update on progress?

westonpace commented 1 year ago

I'm closing this as I think the initial discussion has died down. Feel free to reopen in the future if there are updates or further questions. Or, if a tool is developed, a PR can be opened for the tool itself.