kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.84k stars 893 forks source link

How to get the actual method selected to read a Generic Dataset? #3636

Open gpierard opened 7 months ago

gpierard commented 7 months ago

Description

I am using a GenericDatasetto read txt.gz files, but I can't seem to be able to print the actual method selected by kedro / pandas to actually read it.

The GenericDataSet doc specifies that

pandas.GenericDataSet loads/saves data from/to a data file using an underlying filesystem (e.g.: local, S3, GCS). It uses pandas to dynamically select the appropriate type of read/write target on a best effort basis. and that Parameters: file_format (str) – String which is used to match the appropriate load/save method on a best effort basis. For example if ‘csv’ is passed in the pandas.read_csv and pandas.DataFrame.to_csv will be identified. An error will be raised unless at least one matching read_{file_format} or to_{file_format} method is identified.

But how can I see at runtime what method is selected? In my case that is what is defined in the Catalog

raw_data_soap:
  type: PartitionedDataSet
  path: mypath/
  dataset:
    type: kedro.extras.datasets.pandas.GenericDataSet
    file_format: fwf
    load_args:
      compression: gzip
      encoding: unicode_escape
      widths: [15, 60]
      names: ['TimeStamp', 'Value']
  filename_suffix: .txt.gz

From the arguments I suppose that read_fwf from pandas is used (pandas.read_fwf.html).

How can I make sure ? running inspect.getsource(data_loader) at runtime only shows the generic kedro methods, not the one actually selected by pandas to read my dataset.

def load(self) -> Any:
        self.resolve_load_version()  # Make sure last load version is set
        # gp: patching for debug
        # print(inspect.getsource(super().load))
        # print(inspect.getfile(super().load))
        return super().load()

data_loader() returns the already-read data.

datajoely commented 6 months ago

The GenericDataSet is something that has to be maintained on a 'best effort' basis, Pandas is such an old and complicated library that sometimes the assumptions we need to make don't hold in all cases. A good example of this is the first kwarg of pd.read_csv is "filepath_or_buffer" whereas for pd.read_excel it is io ...

At a high level I think pd.read_fwf should work since it requires no additional dependencies and meets the requirement of pointing to file path (which pd.read_clipboard for example doesn't)

There is a chance that it our assumptions don't work, so it would be super helpful if you could let us know why and we can make the change. Additionally, would you find it helpful or annoying to have a log message for every read saying which loader has been retrieved?

So there are few ways - I think the easiest way to prove this is to copy the implementation to a locally accessible file in your repo, change the classpath of your catalog to:

raw_data_soap:
  type: PartitionedDataSet
  path: mypath/
  dataset:
      type: my_project.path.to.pandas.GenericDataSet
  ...

You can then use a debugger or add logging calls to check that the right loader is being used.

gpierard commented 6 months ago

personally I'd find it quite helpful to have the loader specified., especially since it is not self-evident which load method is being used (or perhaps this can be added in the doc) Thanks !