kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.89k stars 897 forks source link

Make import failures in `kedro-datasets` clearer #2943

Closed astrojuanlu closed 2 months ago

astrojuanlu commented 1 year ago

Description

[!NOTE]
See https://github.com/kedro-org/kedro/issues/2943#issuecomment-1944348860 for current status

Disambiguate the "module does not exist" error from the ImportError in messages like:

DatasetError: An exception occurred when parsing config for dataset 'companies':
Class 'polars.CSVDataSet' not found or one of its dependencies has not been installed.

This task will require investigation into how the error messages are swallowed, which will inform the proper implementation. Probably change: https://github.com/kedro-org/kedro/blob/4194fbd16a992af0320a395c0060aaaea356efb2/kedro/io/core.py#L433

Context

This week I've been battling some puzzling documentation errors and after a while I noticed that, if the dependencies of a particular dataset are not present, the ImportError is swallowed silently. Examples:

https://github.com/kedro-org/kedro-plugins/blob/b8881d113f8082ff03e0233db3ae4557a4c32547/kedro-datasets/kedro_datasets/biosequence/__init__.py#L7-L8

https://github.com/kedro-org/kedro-plugins/blob/b8881d113f8082ff03e0233db3ae4557a4c32547/kedro-datasets/kedro_datasets/networkx/__init__.py#L8-L15

This was done in https://github.com/quantumblacklabs/private-kedro/pull/575 to solve https://github.com/quantumblacklabs/private-kedro/issues/563 at the same time dependencies were moved to extras_require.

I see how not suppressing these errors could be extremely annoying back then, because kedro.io used to re-export all the datasets in its __init__.py:

https://github.com/quantumblacklabs/private-kedro/blob/f7dd2478aec4de1b46afbaded9bce3c69bff6304/kedro/io/__init__.py#L29-L47

# kedro/io/__init__.py

"""``kedro.io`` provides functionality to read and write to a
number of data sets. At core of the library is ``AbstractDataSet``
which allows implementation of various ``AbstractDataSet``s.
"""

from .cached_dataset import CachedDataSet  # NOQA
from .core import AbstractDataSet  # NOQA
from .core import AbstractVersionedDataSet  # NOQA
from .core import DataSetAlreadyExistsError  # NOQA
from .core import DataSetError  # NOQA
from .core import DataSetNotFoundError  # NOQA
from .core import Version  # NOQA
from .data_catalog import DataCatalog  # NOQA
from .data_catalog_with_default import DataCatalogWithDefault  # NOQA
from .lambda_data_set import LambdaDataSet  # NOQA
from .memory_data_set import MemoryDataSet  # NOQA
from .partitioned_data_set import IncrementalDataSet  # NOQA
from .partitioned_data_set import PartitionedDataSet  # NOQA
from .transformers import AbstractTransformer  # NOQA

However, now our __init__.py is empty and datasets are meant to be imported separately:

https://github.com/kedro-org/kedro-plugins/blob/b8881d113f8082ff03e0233db3ae4557a4c32547/kedro-datasets/kedro_datasets/__init__.py#L1-L3

So I think it would be much better if we did not silence those import errors.

More context

If one dependency is missing, the user would get an unhelpful "module X has no attribute Y" when trying to import a dataset rather than an actual error:

> pip uninstall biopython                          (kedro-dev) 
Found existing installation: biopython 1.81
Uninstalling biopython-1.81:
  Would remove:
    /Users/juan_cano/.micromamba/envs/kedro-dev/lib/python3.10/site-packages/Bio/*
    /Users/juan_cano/.micromamba/envs/kedro-dev/lib/python3.10/site-packages/BioSQL/*
    /Users/juan_cano/.micromamba/envs/kedro-dev/lib/python3.10/site-packages/biopython-1.81.dist-info/*
Proceed (Y/n)? y
  Successfully uninstalled biopython-1.81
> python                                           (kedro-dev) 
Python 3.10.9 | packaged by conda-forge | (main, Feb  2 2023, 20:26:08) [Clang 14.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from kedro_datasets.biosequence import BioSequenceDataSet
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name 'BioSequenceDataSet' from 'kedro_datasets.biosequence' (/Users/juan_cano/.micromamba/envs/kedro-dev/lib/python3.10/site-packages/kedro_datasets/biosequence/__init__.py)
noklam commented 1 year ago

Last time I saw this I was thinking the same, it turns out to be not trivial because of this. Not sure if things are still the same but worth mentioning.

https://github.com/kedro-org/kedro/blob/38dfe6f5e9d2e8758a732044267769aafb1ff465/kedro/io/core.py#L381-L389

astrojuanlu commented 1 year ago

Thanks @noklam - I see that an error saying "some of the dependencies have not been installed" is raised if the desired class is the list of_DEFAULT_PACKAGES (which includes "", hence allowing the user to type the full import path) is exhausted:

https://github.com/kedro-org/kedro/blob/38dfe6f5e9d2e8758a732044267769aafb1ff465/kedro/io/core.py#L392-L396

I see how we would not like to bubble all ImportErrors to the user because in most cases, only 1 of those prefixes will have the desired class. Maybe we'd want to bubble up ModuleNotFoundError errors, but definitely I'd have to think this through.

noklam commented 1 year ago

@astrojuanlu I think I come to similar conclusion, actually I am not sure why the issue disappear I thought I created it but I couldn't find it...

noklam commented 1 year ago

I would push for this if it's possible. Definitely one of the most annoying thing and as a beginner you will hit this issue for sure (almost as annoying as the DataSet vs Dataset typo issue)

astrojuanlu commented 1 year ago

Updated the title of this issue and adding more evidence:

https://linen-slack.kedro.org/t/14153932/hey-team-i-m-getting-the-following-exception-an-exception-oc#8b6debb1-ed6f-4a16-9440-5a03d0370b07

Hey team, I'm getting the following exception:

An exception occurred when parsing config for dataset 'raw_web@delta':
Object 'DeltaTableDataSet' cannot be loaded from 'kedro.extras.datasets.spark'. Please see the documentation on how to install relevant dependencies for kedro.extras.datasets.spark.DeltaTableDataSet:
<https://kedro.readthedocs.io/en/stable/kedro_project_setup/dependencies.html>

If I go to the link, it doesn't tell me what I should be installing. I already have the following in the requirements.txt: kedro-datasets[spark,delta]

Indeed, https://kedro.readthedocs.io/en/stable/kedro_project_setup/dependencies.html contains a lot and is difficult to parse.

I believe the same thing happened just this morning to @NeroOkwa.

Maybe it would be better if the underlying ImportError could be shown.

In any case, this is something that should be changed on Kedro, so I'm moving the issue.

noklam commented 1 year ago

Remove all the with suppress(ImportError):.

Without reading the full context, are you aware of this is merged already? I think the error is surfaced now already (with kedro-datasets). I haven't fully tested it tho.

astrojuanlu commented 1 year ago

This is what I see now:

DatasetError: An exception occurred when parsing config for dataset 'companies':
Class 'polars.CSVDataSet' not found or one of its dependencies has not been installed.

doesn't tell me whether I made a typo, or what dependency I need to install. Am I missing something? This is with kedro-datasets 1.5.2

noklam commented 1 year ago

Will revisit that when we fixed the dataset installation issue. If this is not the case yet, I think we are in a good position to solve this finally.

noklam commented 11 months ago

I want to have more clarification of this ticket. Is this a kedro or kedro-datasets ticket? The description and discussion is slightly outdated because https://github.com/kedro-org/kedro-plugins/pull/277 change how we import library.

The original error mentioned in the description is something in kedro rather than kedro-datasets

DatasetError: An exception occurred when parsing config for dataset 'companies': Class 'polars.CSVDataSet' not found or one of its dependencies has not been installed.

from kedro_datasets.biosequence import BioSequenceDataSet

ModuleNotFoundError: No module named 'Bio'

@astrojuanlu Would you be able to give me an fail cases so I can investigate this?

astrojuanlu commented 11 months ago

As far as I understand, the logic to import datasets still lives in Kedro:

https://github.com/kedro-org/kedro/blob/93dc1a91e4bb476287040ea3db4a610696cacb0c/kedro/io/core.py#L389-L395

So in principle I think we can keep the ticket here, but up to you.

Let me give a reproducer.

astrojuanlu commented 11 months ago
❯ kedro info                                                                                                                                                                                    (kedro310) 
As an open-source project, we collect usage analytics. 
We cannot see nor store information contained in a Kedro project. 
You can find out more by reading our privacy notice: 
https://github.com/kedro-org/kedro-plugins/tree/main/kedro-telemetry#privacy-notice 
Do you opt into usage analytics?  [y/N]: N
You have opted out of product usage analytics, so none will be collected.

 _            _
| | _____  __| |_ __ ___
| |/ / _ \/ _` | '__/ _ \
|   <  __/ (_| | | | (_) |
|_|\_\___|\__,_|_|  \___/
v0.18.14

Kedro is a Python framework for
creating reproducible, maintainable
and modular data science code.

Installed plugins:
kedro_telemetry: 0.2.4 (entry points:cli_hooks,hooks)
kedro_viz: 6.3.1 (entry points:global,line_magic)
❯ pip list | grep datasets                                                                                                                                                                      (kedro310) 
kedro-datasets                1.8.0
  1. kedro new --starter=pandas-iris
  2. kedro run (everything works fine)
  3. Change type: biosequence.BiosequenceDataSet in catalog.yml (deliberate typo)
  4. kedro run:
DatasetError: An exception occurred when parsing config for dataset 'example_iris_data':
Class 'biosequence.BiosequenceDataSet' not found or one of its dependencies has not been installed.
  1. (confusion)
  2. pip install kedro-datasets[biosequence]
  3. kedro run
DatasetError: An exception occurred when parsing config for dataset 'example_iris_data':
Class 'biosequence.BiosequenceDataSet' not found or one of its dependencies has not been installed.
  1. (confusion)
  2. change to type: biosequence.BioSequenceDataSet
  3. kedro run works ✔️

(notice that soon this will happen with DataSet vs Dataset

noklam commented 11 months ago

😅 from now on I may just repharse partially fix to part of #xxx.

astrojuanlu commented 7 months ago

Today we found another example of this giving an unhelpful error message:

pandas was present but kedro-datasets was not, so we got the "is this a typo?" error instead of the obvious "install kedro-datasets".

This was on top of https://github.com/kedro-org/kedro-plugins/issues/402 (since that fix hasn't been released yet) and https://github.com/kedro-org/kedro-plugins/issues/313.

@noklam should I open a new issue?

noklam commented 7 months ago

Do you have an example of that? Just checking how does this happen.

astrojuanlu commented 7 months ago

@iamelijahko could you provide a series of steps to arrive to the error you saw initially?

iamelijahko commented 7 months ago

Screenshot 2024-02-14 at 17 39 56

The screenshot depicts a sequence where the installation of a Kedro-related package with pandas support is attempted but fails due to missing system-level dependencies (HDF5). Following this, an attempt to start a Kedro IPython session also fails due to an issue with loading dataset definitions, which may or may not be related to the prior installation issue.

  1. I tried to install kedro-datasets[pandas] with pip for Kedro and pandas use.
  2. Installation needed more packages. One, probably the tables package, required HDF5 and failed to build.
  3. The failure was due to missing HDF5 headers; the error mentioned 'H5public.h' not found.
  4. After this, I tried to start a Kedro IPython session, which didn't work.
  5. Kedro couldn't load the project because of a DatasetError.
astrojuanlu commented 7 months ago

The installation issue was tracked in https://github.com/kedro-org/kedro-plugins/issues/402

For clarity, the problem here is that the "is this a typo?" error message seems to hint that the dataset name is mistyped, when in fact the dependencies are missing.

I'm just reopening.

noklam commented 7 months ago

For the record, the problem that we used to have is fixed, this is failing for a separate reason. This happened because the installation fails, so NOTHING is installed, not just hdf5. It is prompting for typo because kedro-datasets doesn't exist, it's not because of individual dataset dependency.

You will see same error if you put type: RANDOM.padnas.CSVDataset. We have two options here:

  1. Either treat kedro-datasets as a special case, and we check separately it kedro-datasets is importable.
  2. Remove the warning for typo, this is still not good because it will not warn "kedro-datasets" is not installed either.

I tend to think this is a niche case and the priority is lower. I remove the estimate and move it back to inbox

astrojuanlu commented 3 months ago

Addressed in #3952 @ankatiyar ?

astrojuanlu commented 2 months ago

I'm guessing so. Closing, but feel free to reopen if there's anything left.