kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.47k stars 875 forks source link

[DataCatalog]: Hard to to manage large catalogs #3938

Open ElenaKhaustova opened 3 weeks ago

ElenaKhaustova commented 3 weeks ago

Description

Users find it difficult to manage large catalogs as the current separate from code configuration structure requires excessive navigation back and forth, YAML-based data catalog is cumbersome to manage and navigate.

We propose to:

  1. Explore the opportunity to offer an alternative to YAML-based catalogs that can be integrated with the current configuration approach.
  2. Explore how existing VS Code plugin simplifies working with large catalogs and extend it with features for easy navigation.

Context

"I believe there's room for innovation in how the data catalog is structured in relation to the code. Currently, the configuration is organized differently and separately from the code, which requires a lot of navigation back and forth. Maybe an alternative where the catalog lives closer to where it's used could potentially reduce this overhead and improve productivity."

merelcht commented 3 weeks ago

I'd like to understand this pain point a bit more. Is it just about navigating between the catalog and pipeline/nodes? The topic of large catalogs has come up in the past and we've built several solutions for it: e.g. environments and dataset factories. But is sounds like this user is struggling with something slightly different maybe.