kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
10.03k stars 906 forks source link

Clearer underlying dataset issues #3971

Open datajoely opened 5 months ago

datajoely commented 5 months ago

Description

A user reported that Kedro was unable to read the CSV, they get the following logs in AWS: image

The "No columns to parse from file" is being thrown by the underlying pandas implementation in this file

It would be helpful if Kedro could bubble up that the error is thrown in pandas.io.parsers.python_parser so that it is clear where the issue lies. The error above, mentions kedro.io.core.DatasetError is it not possible to do the same?

astrojuanlu commented 4 months ago

It is unclear why those logs don't show tracebacks.

Anyway, the current implementation of AbstractDataset is responsible for that DatasetError:

https://github.com/kedro-org/kedro/blob/adfc593bcd2f1b74676e7ab7c1a3b9c168b7257f/kedro/io/core.py#L192-L202

datajoely commented 4 months ago

They must be in the exc object somewhere, I refuse to believe otherwise

ElenaKhaustova commented 4 months ago

Thank you @datajoely! Could you please provide some more context on what AWS service was used to run kedro pipeline? We would like to check if the service is filtering the error messages as it seems like we always showcase the entire error log.

datajoely commented 4 months ago

I've asked the user to comment here to double check, but I think it was:

Docker image running on AWS ECS

astrojuanlu commented 3 weeks ago

First, I amend my comment above: the traceback is there (File /usr/local/...).

The problem of AbstractDataset hiding the real error has been mentioned in other places (https://github.com/kedro-org/kedro/issues/1936#issuecomment-1727172650, https://github.com/kedro-org/kedro/issues/2199#issuecomment-2101008300) although I don't think we have an issue for it (@ElenaKhaustova?). If that's the case, maybe we can keep this issue open?

astrojuanlu commented 1 week ago

In #2943 we partly addressed the issue of unclear errors with datasets. Yet we have a bit more evidence about this still being a problem.

For example: https://kedro.hall.community/running-kedroviz-on-docker-without-installing-the-library-H0d61LTldx29#bae33c48-aa82-447b-82e7-80486a95ecef

The user was getting

Class 'projx.models.audio.io.LargeModel' not found, is this a typo?

but the actual underlying error was:

>>> from projx.models.audio.io import LargeModel
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/app/src/projx/models/audio/__init__.py", line 1, in <module>
    from .base import LAM
  File "/app/src/projx/models/audio/base.py", line 1, in <module>
    from elevenlabs.client import ElevenLabs
ModuleNotFoundError: No module named 'elevenlabs'

Another internal user reported this today.