kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.82k stars 895 forks source link

Using kedro for a data dictionary #1234

Closed eepgwde closed 2 years ago

eepgwde commented 2 years ago

Description

Is your feature request related to a problem? A clear and concise description of what the problem is: "I'm always frustrated when ..."

Most Data Warehouses have a painstakingly generated and curated data dictionary. Every field of every table is politely described, its usages are located and it is possible to edit descriptions and add caveats.

Context

Why is this change important to you? How would you use it? How can it benefit other users?

Kedro has a very good catalog for tables and pipelines. But nothing for columns or schema or metadata.

Possible Implementation

(Optional) Suggest an idea for implementing the addition or change.

Would it be possible to add a "metadata" pipeline that can go through all the tables that currently exist and catalog their columns.

Possible Alternatives

(Optional) Describe any alternative solutions or features you've considered.

Altova and other companies do have software solutions for databases and spreadsheets. These work by using some XML schema extraction and building a nested DOM.

datajoely commented 2 years ago

Hi @eepgwde I would absolutely love this - it's come up before and I would love to introduce this eventually or perhaps see an open source kedro-data-docs plugin emerge like we've seen for MLFlow, Dolt and Neptune :)

Would you mind adding your thoughts to https://github.com/kedro-org/kedro/issues/1076 since it's actually talking about this very problem 🚀

merelcht commented 2 years ago

Closing this ticket now in favour of continuing the discussion in https://github.com/kedro-org/kedro/issues/1076