Put kedro catalog on-line

kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.

https://kedro.org

Apache License 2.0

9.94k stars 904 forks source link

Put kedro catalog on-line #1239

Closed eepgwde closed 10 months ago

eepgwde commented 2 years ago

Description

The kedro catalog is so useful, it can be used by non-kedro users as a data dictionary.

Context

The data processing done by kedro is usually made available by users on Cloud Storage or cloud services. It would be useful to see a table's load path, so that an end-user could take the S3 and use it with Spark, Athena or even PowerBI.

A chain of the load paths for a pipeline. A set of URIs for the tables.

And other useful things. Recording notes about tables. Writing up constraints. A multi-user Wiki on a Kedro project.

Possible Implementation

I think it would be a nodejs server. Mostly of the JavaScript could be server side.

Possible Alternatives

I have used similar data dictionaries from Altova. You have to do a lot of the coding yourself.

There are academics working on "Wikify your Metadata!"

datajoely commented 2 years ago

I think this one is also related to the https://github.com/kedro-org/kedro/issues/1076 - if you have the metadata to build docs, this becomes an implementation detail where you host them

astrojuanlu commented 1 year ago

It is not entirely clear to me whether this issue is about putting the catalog.yml (and companion files, for example a directory with different catalog* patterns) in remote locations (say, an object storage like S3) and accessing them from Python, or rather creating a web application that serves the catalog under an API + deploying such app in a cloud service.

@eepgwde I know it's been a long time but by any chance would you like to provide a bit more context?

datajoely commented 1 year ago

Slightly tangential - but I think it would be interesting to allow the kedro run --conf-source=<path-to-new-conf-directory> to support fsspec. It would also allow multiple projects to share a catalog.

datajoely commented 1 year ago

We actually have this CLI level logic available for micropackaging

merelcht commented 10 months ago

This issue hasn't had any recent activity, so I'm closing it.