kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.47k stars 875 forks source link

Update error message when `kedro-datasets` is not installed or `DataSet` spelling is used #3952

Closed ankatiyar closed 1 week ago

ankatiyar commented 2 weeks ago

Description

Resolve #3911, resolve #3909

Development notes

The error message below shows up in a number of cases:

Class '{dataset_name}' not found, is this a typo?

When a user is trying to use a Kedro (not custom) dataset and -

Since not having kedro-datasets is not the only way this error is triggered, I thought I'd update the error message with a "Hint" like in #3944 which suggests it could be a potential solution without assuming it is the ONLY solution. The hint is different (related to #3909) when "DataSet" spelling is used. I didn't change the "is this a typo?" part because that very well might be a case, specially if it's a custom dataset which still should work with the "DataSet" spelling unless the user has changed it.

I'd like to see what the team thinks about the messaging of the error! There's still some ongoing discussion on #3909 so I can revert this part for now.

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

merelcht commented 2 weeks ago

I'm happy with this approach 👍

datajoely commented 2 weeks ago

In the past we try to improve (3) but (1)/(2) remains confusing. I wonder if it is simpler to list out all 3 cases explicitly and avoid trying to be too smart?

Explicit is better than implicit

ankatiyar commented 2 weeks ago

@noklam It seems like if an optional requirement is missing, this error is not triggered. It is when trying to run the project, the error that shows up is when the dataset is being loaded(or saved) -

kedro.io.core.DatasetError: Failed while loading data from data set ExcelDataset(filepath=/Users/ankita_katiyar/kedro_projects/demo/data/01_raw/shuttles.xlsx, load_args={'engine': openpyxl}, protocol=file, save_args={'index': False}, writer_args={'engine': openpyxl}).
Missing optional dependency 'openpyxl'.  Use pip or conda to install openpyxl.
datajoely commented 1 week ago

This is great - in the past this was an issue because sometimes it was stupid to install all of spark just to just run the Pandas parts.

I wonder if it's finally time for us to may something like kedro run --dry a 1st class command for validating if a session is 'runnable'. It would be useful for our own testing and for users to build an integration test.

noklam commented 1 week ago

@ankatiyar yes, the error is triggered in different place. Since this issue only focus on the typo / kedro-datasets missing case, I have approved this now.