kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.47k stars 875 forks source link

[DataCatalog]: Error message is confusing when using `DataSet` instead of `Dataset` #3909

Closed ElenaKhaustova closed 2 weeks ago

ElenaKhaustova commented 1 month ago

Description

There is confusion between DataSet and Dataset terminology, and the error message is not informative when using old naming. They have been renamed in 0.19, but people miss that fact when switching to the new version.

Relates to https://github.com/kedro-org/kedro/issues/2401

Context

Example of the current error message:

DatasetError: An exception occurred when parsing config for dataset 'companies':
Class 'pandas.CSVDataSet' not found, is this a typo?
datajoely commented 1 month ago

I wonder if we could auto-fix this with a bug warning

astrojuanlu commented 3 weeks ago

I have a couple questions on this one:

ElenaKhaustova commented 3 weeks ago

I have a couple questions on this one:

  • As far as I understand, projects created with our starters have a kedro_init_version that limits what version of Kedro can be used. If, say, someone created a project with Kedro 0.18 (uppercase S datasets from kedro.extras) and then tried to use Kedro 0.19 (no kedro.extras at all, need to install kedro-datasets), they would get an error, right?

    • I also reckon though that the versioning of kedro-datasets is not limited by kedro_init_version
    • In other words, how does this problem manifest itself nowadays? What sequence of steps gets us to here?
  • It is well known that upgrading a Kedro version is hard in general (but I could not locate an issue for it). By looking at this problem from that angle, and considering that clearly it arises from people not reading our existing migration guides, can we provide linters ("kedro-lint") or semi-automatic migration utils ("kedro-modernize") to help with this task, rather than limiting ourselves to improving the traceback?

So far, we know that this is still happening when users already have Kedro project created for the older version but upgrading Kedro to a newer version. Another reason that was mentioned by interviewees is that our old blog posts have examples with old naming, which is fair because some time ago, it was relevant. But some of them still follow those examples and get confused.

I've also requested some extra details from the user side to better answer your questions.

ElenaKhaustova commented 3 weeks ago

@astrojuanlu the blog post mentioned above: https://kedro.org/blog/add-kedro-to-your-data-science-notebook

astrojuanlu commented 3 weeks ago

Very good point about old training material using the old names, didn't think about that... This might be a problem that will need some time to go away then, and we might indeed need to take some action on our side.

merelcht commented 3 weeks ago

Looking at the error:

DatasetError: An exception occurred when parsing config for dataset 'companies':
Class 'pandas.CSVDataSet' not found, is this a typo?

I would still argue that the error isn't confusing, it states exactly what the problem is: spelling DataSet with a capital S instead of lower case s, which is indeed a typo. Now the question is whether we can add some additional clarification so that people check that lower/upper-case spelling. At the same time, it will be tricky to do specific matching for DataSet endings, because the user could have custom datasets that have that spelling and work fine.