kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.47k stars 875 forks source link

[DataCatalog]: Error message is confusing when kedro-dataset is not installed #3911

Closed ElenaKhaustova closed 2 weeks ago

ElenaKhaustova commented 1 month ago

Description

When kedro-datasets is not installed the error message one gets is not informative.

We propose enhancing the error message to provide a clear message on the root cause of the failure - when dataset dependencies are missing.

Relates to https://github.com/kedro-org/kedro/issues/2401

Context

Currently, users are required to install all dependencies even for unused datasets (in case you want to run pipeline partially or do not load some datasets when standalone catalog usage). The error message generated when some datasets are not installed is unclear, making it difficult for users to understand why the pipeline fails.

Example of the current error message:

DatasetError: An exception occurred when parsing config for dataset 'companies':
Class 'pandas.CSVDataset' not found, is this a typo?

This error occurs during dataset configuration parsing and lacks clarity, not suggesting the straightforward solution of installing the necessary package. This can lead to confusion and delays as users may not immediately realize that the issue is due to missing software rather than a typo in their configuration.

astrojuanlu commented 3 weeks ago

Related to #2943

astrojuanlu commented 3 weeks ago

Also, as much as I'd like to see a more explicit call for users to pip install kedro-datasets[whatever], it's also true that kedro-datasets is not, and should not, be the only package providing the datasets the user is looking for...

The phrasing of the error message & call to action here is important.