Closed ashvina closed 1 year ago
I've never heard of Apache Atlas before, but this sounds useful in my naive understanding.
Could you explain for folks like me that don't have experience with these tools roughly how the process for extracting and using this information would go?
Like, how does Atlas interact with Substrait plans to create this lineage information?
@ashvina is best to discuss the details. My <2cents> is that this is very useful. Atlas is very popular for governance/lineage tracking. This integration would make it particularly easy for any system that integrate with Substrait to get high quality lineage extraction for cheap. As we work on making Substrait widely accepted I think this type of tooling/integrations are very important to increase the overall "why do I pick substrait" value prop.
This is an exciting proposal. Any update on progress?
I'm closing this as I think the initial discussion has died down. Feel free to reopen in the future if there are updates or further questions. Or, if a tool is developed, a PR can be opened for the tool itself.
Background:
Why Substrait-to-Atlas tool:
Substrait provides a well-defined cross language specification for data compute operations, with a goal to get consistent interpretation of the semantics. The standardization can offer a wide range of “bonus-benefits” by providing a library of tools to enable seamless integration of the data engine with other platforms in the data ecosystem. Out-of-the-box integration with Atlas would be good candidate. In other words, any data engine that communicates using Substrait plans can easily be integrated with Atlas governance platform. As such the data engines can avoid building a custom Atlas hook and duplicating the effort.
How:
I’ve been studying the Substrait specification and playing around with the substrait-java code. My observations is that the information about tables and columns accessed (input-data-entities) and affected (output-data-entities) by the query (process-entity) execution is readily available in Substrait plans. For instance, the “directReference” field in expressions and measures correspond to the input data entities from lineage perspective. This information can be used to construct Atlas entities and input and output relations.
I am still new to the project, so let me know if there's a preference on where the contribution should be committed.