kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.47k stars 875 forks source link

[DataCatalog]: Add functionality to search datasets in the catalog #3917

Open ElenaKhaustova opened 1 month ago

ElenaKhaustova commented 1 month ago

Description

Users struggle to find datasets within the catalog, particularly when dealing with a large number of datasets. They express the need for search features to facilitate dataset discovery.

Context

"As a user in my list object, I can filter by name but I can't filter by what. So it would be good to be able to say give me all the sql datasets and then the names of the tables that are attached."

Comment form @astrojuanlu: Kedro Viz has an item in their roadmap to include a table view of all the metadata, could help with this.

Possible Implementation

Integrate search functionality into the catalog, enabling users to search for datasets based on keywords, patterns and by kind. Include support for regex search to accommodate users with advanced search requirements.

datajoely commented 1 month ago

Also search by kind - if I wanted to find all Parquet files today I'd have to get very creative. Retrieving paths associated with those would be super complicated.

astrojuanlu commented 1 month ago

@stephkaiser do you remember if we already opened an issue or discussion about the "metadata table view"?

(cc @rashidakanchwala for when you're back)

yury-fedotov commented 1 month ago

... enabling users to search for datasets based on keywords or patterns...

Isn't it what catalog.list() already does?

IIRC if you do e.g. catalog.load("compani"), it would raise an error with did you mean one of ["companies", "processed_companies"]?

astrojuanlu commented 1 month ago

Related #3312

stephkaiser commented 1 month ago

@stephkaiser do you remember if we already opened an issue or discussion about the "metadata table view"?

(cc @rashidakanchwala for when you're back)

@astrojuanlu we currently don't have an issue for this, I believe it was an idea we discussed when discussing this issue https://github.com/kedro-org/kedro-viz/issues/1635

astrojuanlu commented 3 weeks ago

Notice that catalog.list() supports RegEx, see https://github.com/kedro-org/kedro/pull/3924

astrojuanlu commented 3 weeks ago

But

Integrate search functionality into the catalog, enabling users to search for datasets based on keywords, patterns and by kind. Include support for regex search to accommodate users with advanced search requirements.

this is a bit more advanced, I'd say

merelcht commented 3 weeks ago

When you say "search datasets in the catalog", what workflow are you talking about? Is this inside a notebook, on the CLI, directly in the IDE or on Kedro-Viz? Each of these user flows might have a different preferred solution.