Closed astrojuanlu closed 2 months ago
Last time I saw this I was thinking the same, it turns out to be not trivial because of this. Not sure if things are still the same but worth mentioning.
Thanks @noklam - I see that an error saying "some of the dependencies have not been installed" is raised if the desired class is the list of_DEFAULT_PACKAGES
(which includes ""
, hence allowing the user to type the full import path) is exhausted:
I see how we would not like to bubble all ImportError
s to the user because in most cases, only 1 of those prefixes will have the desired class. Maybe we'd want to bubble up ModuleNotFoundError
errors, but definitely I'd have to think this through.
@astrojuanlu I think I come to similar conclusion, actually I am not sure why the issue disappear I thought I created it but I couldn't find it...
I would push for this if it's possible. Definitely one of the most annoying thing and as a beginner you will hit this issue for sure (almost as annoying as the DataSet vs Dataset typo issue)
Updated the title of this issue and adding more evidence:
Hey team, I'm getting the following exception:
An exception occurred when parsing config for dataset 'raw_web@delta': Object 'DeltaTableDataSet' cannot be loaded from 'kedro.extras.datasets.spark'. Please see the documentation on how to install relevant dependencies for kedro.extras.datasets.spark.DeltaTableDataSet: <https://kedro.readthedocs.io/en/stable/kedro_project_setup/dependencies.html>
If I go to the link, it doesn't tell me what I should be installing. I already have the following in the requirements.txt: kedro-datasets[spark,delta]
Indeed, https://kedro.readthedocs.io/en/stable/kedro_project_setup/dependencies.html
contains a lot and is difficult to parse.
I believe the same thing happened just this morning to @NeroOkwa.
Maybe it would be better if the underlying ImportError
could be shown.
In any case, this is something that should be changed on Kedro, so I'm moving the issue.
Remove all the with suppress(ImportError):.
Without reading the full context, are you aware of this is merged already? I think the error is surfaced now already (with kedro-datasets). I haven't fully tested it tho.
This is what I see now:
DatasetError: An exception occurred when parsing config for dataset 'companies':
Class 'polars.CSVDataSet' not found or one of its dependencies has not been installed.
doesn't tell me whether I made a typo, or what dependency I need to install. Am I missing something? This is with kedro-datasets 1.5.2
Will revisit that when we fixed the dataset installation issue. If this is not the case yet, I think we are in a good position to solve this finally.
I want to have more clarification of this ticket. Is this a kedro
or kedro-datasets
ticket? The description and discussion is slightly outdated because https://github.com/kedro-org/kedro-plugins/pull/277 change how we import library.
The original error mentioned in the description is something in kedro
rather than kedro-datasets
DatasetError: An exception occurred when parsing config for dataset 'companies': Class 'polars.CSVDataSet' not found or one of its dependencies has not been installed.
from kedro_datasets.biosequence import BioSequenceDataSet
ModuleNotFoundError: No module named 'Bio'
@astrojuanlu Would you be able to give me an fail cases so I can investigate this?
As far as I understand, the logic to import datasets still lives in Kedro:
So in principle I think we can keep the ticket here, but up to you.
Let me give a reproducer.
❯ kedro info (kedro310)
As an open-source project, we collect usage analytics.
We cannot see nor store information contained in a Kedro project.
You can find out more by reading our privacy notice:
https://github.com/kedro-org/kedro-plugins/tree/main/kedro-telemetry#privacy-notice
Do you opt into usage analytics? [y/N]: N
You have opted out of product usage analytics, so none will be collected.
_ _
| | _____ __| |_ __ ___
| |/ / _ \/ _` | '__/ _ \
| < __/ (_| | | | (_) |
|_|\_\___|\__,_|_| \___/
v0.18.14
Kedro is a Python framework for
creating reproducible, maintainable
and modular data science code.
Installed plugins:
kedro_telemetry: 0.2.4 (entry points:cli_hooks,hooks)
kedro_viz: 6.3.1 (entry points:global,line_magic)
❯ pip list | grep datasets (kedro310)
kedro-datasets 1.8.0
kedro new --starter=pandas-iris
kedro run
(everything works fine)type: biosequence.BiosequenceDataSet
in catalog.yml
(deliberate typo)kedro run
:DatasetError: An exception occurred when parsing config for dataset 'example_iris_data':
Class 'biosequence.BiosequenceDataSet' not found or one of its dependencies has not been installed.
pip install kedro-datasets[biosequence]
kedro run
DatasetError: An exception occurred when parsing config for dataset 'example_iris_data':
Class 'biosequence.BiosequenceDataSet' not found or one of its dependencies has not been installed.
type: biosequence.BioSequenceDataSet
kedro run
works ✔️ (notice that soon this will happen with DataSet
vs Dataset
😅 from now on I may just repharse partially fix
to part of #xxx
.
Today we found another example of this giving an unhelpful error message:
pandas
was present but kedro-datasets
was not, so we got the "is this a typo?" error instead of the obvious "install kedro-datasets".
This was on top of https://github.com/kedro-org/kedro-plugins/issues/402 (since that fix hasn't been released yet) and https://github.com/kedro-org/kedro-plugins/issues/313.
@noklam should I open a new issue?
Do you have an example of that? Just checking how does this happen.
@iamelijahko could you provide a series of steps to arrive to the error you saw initially?
The screenshot depicts a sequence where the installation of a Kedro-related package with pandas support is attempted but fails due to missing system-level dependencies (HDF5). Following this, an attempt to start a Kedro IPython session also fails due to an issue with loading dataset definitions, which may or may not be related to the prior installation issue.
kedro-datasets[pandas]
with pip for Kedro and pandas use.'H5public.h'
not found.Kedro IPython
session, which didn't work.The installation issue was tracked in https://github.com/kedro-org/kedro-plugins/issues/402
For clarity, the problem here is that the "is this a typo?" error message seems to hint that the dataset name is mistyped, when in fact the dependencies are missing.
I'm just reopening.
For the record, the problem that we used to have is fixed, this is failing for a separate reason. This happened because the installation fails, so NOTHING is installed, not just hdf5. It is prompting for typo
because kedro-datasets
doesn't exist, it's not because of individual dataset dependency.
You will see same error if you put type: RANDOM.padnas.CSVDataset
. We have two options here:
kedro-datasets
as a special case, and we check separately it kedro-datasets
is importable. I tend to think this is a niche case and the priority is lower. I remove the estimate and move it back to inbox
Addressed in #3952 @ankatiyar ?
I'm guessing so. Closing, but feel free to reopen if there's anything left.
Description
Disambiguate the "module does not exist" error from the
ImportError
in messages like:This task will require investigation into how the error messages are swallowed, which will inform the proper implementation. Probably change: https://github.com/kedro-org/kedro/blob/4194fbd16a992af0320a395c0060aaaea356efb2/kedro/io/core.py#L433
Context
This week I've been battling some puzzling documentation errors and after a while I noticed that, if the dependencies of a particular dataset are not present, the
ImportError
is swallowed silently. Examples:https://github.com/kedro-org/kedro-plugins/blob/b8881d113f8082ff03e0233db3ae4557a4c32547/kedro-datasets/kedro_datasets/biosequence/__init__.py#L7-L8
https://github.com/kedro-org/kedro-plugins/blob/b8881d113f8082ff03e0233db3ae4557a4c32547/kedro-datasets/kedro_datasets/networkx/__init__.py#L8-L15
This was done in https://github.com/quantumblacklabs/private-kedro/pull/575 to solve https://github.com/quantumblacklabs/private-kedro/issues/563 at the same time dependencies were moved to
extras_require
.I see how not suppressing these errors could be extremely annoying back then, because
kedro.io
used to re-export all the datasets in its__init__.py
:https://github.com/quantumblacklabs/private-kedro/blob/f7dd2478aec4de1b46afbaded9bce3c69bff6304/kedro/io/__init__.py#L29-L47
However, now our
__init__.py
is empty and datasets are meant to be imported separately:https://github.com/kedro-org/kedro-plugins/blob/b8881d113f8082ff03e0233db3ae4557a4c32547/kedro-datasets/kedro_datasets/__init__.py#L1-L3
So I think it would be much better if we did not silence those import errors.
More context
If one dependency is missing, the user would get an unhelpful "module X has no attribute Y" when trying to import a dataset rather than an actual error: