kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.53k stars 877 forks source link

cant reproduce the AbstractVersionedDataSet example #842

Closed naarkhoo closed 3 years ago

naarkhoo commented 3 years ago

Description

short: I can't reproduce AbstractVersionedDataSet example provided https://kedro.readthedocs.io/en/latest/kedro.io.AbstractVersionedDataSet.html

Context

I was trying to read files as datatabe - I have been getting some other related to the other example https://kedro.readthedocs.io/en/stable/07_extend_kedro/03_custom_datasets.html#implement-the-load-method-with-fsspec - saying the object doesn't have .load attribute.

Steps to Reproduce

  1. make file structure similar to what is described (src/package_name/extracs/datasets)

  2. make a yaml file like

    test_dt:
    type: v2_kedro.extras.datasets.data_table.MyOwnDataSet
    filepath: data/01_raw/myfile.csv.gz
    version: '1'
  3. [And so on...]

Expected Result

to read the csv file and print

Actual Result

error

kedro.io.core.DataSetError: 
__init__() missing 1 required positional argument: 'version'.
DataSet 'test_dt' must only contain arguments valid for the constructor of `v2_kedro.extras.datasets.data_table.MyOwnDataSet`.
[TerminalIPythonApp] WARNING | Unknown error in handling startup files:
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/opt/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro/io/core.py in from_config(cls, name, config, load_version, save_version)
    176         try:
--> 177             data_set = class_obj(**config)  # type: ignore
    178         except TypeError as err:

TypeError: __init__() missing 1 required positional argument: 'version'

The above exception was the direct cause of the following exception:

DataSetError                              Traceback (most recent call last)
~/opt/miniconda3/envs/kedro/lib/python3.8/site-packages/IPython/core/shellapp.py in _exec_file(self, fname, shell_futures)
    335                     else:
    336                         # default to python, even without extension
--> 337                         self.shell.safe_execfile(full_filename,
    338                                                  self.shell.user_ns,
    339                                                  shell_futures=shell_futures,

~/opt/miniconda3/envs/kedro/lib/python3.8/site-packages/IPython/core/interactiveshell.py in safe_execfile(self, fname, exit_ignore, raise_exceptions, shell_futures, *where)
   2708             try:
   2709                 glob, loc = (where + (None, ))[:2]
-> 2710                 py3compat.execfile(
   2711                     fname, glob, loc,
   2712                     self.compile if shell_futures else None)

~/opt/miniconda3/envs/kedro/lib/python3.8/site-packages/IPython/utils/py3compat.py in execfile(fname, glob, loc, compiler)
    186     with open(fname, 'rb') as f:
    187         compiler = compiler or compile
--> 188         exec(compiler(f.read(), fname, 'exec'), glob, loc)
    189 
    190 # Refactor print statements in doctests.

~/Devel/engr-datascience/rate_prediction/v2_kedro/.ipython/profile_default/startup/00-kedro-init.py in <module>
     79 
     80 
---> 81 reload_kedro(project_path)

~/opt/miniconda3/envs/kedro/lib/python3.8/site-packages/decorator.py in fun(*args, **kw)
    230             if not kwsyntax:
    231                 args, kw = fix(args, kw, sig)
--> 232             return caller(func, *(extras + args), **kw)
    233     fun.__name__ = func.__name__
    234     fun.__doc__ = func.__doc__

~/opt/miniconda3/envs/kedro/lib/python3.8/site-packages/IPython/core/magic.py in <lambda>(f, *a, **k)
    218     # but it's overkill for just that one bit of state.
    219     def magic_deco(arg):
--> 220         call = lambda f, *a, **k: f(*a, **k)
    221 
    222         # Find get_ipython() in the caller's namespace

~/Devel/engr-datascience/rate_prediction/v2_kedro/.ipython/profile_default/startup/00-kedro-init.py in reload_kedro(path, line, env, extra_params)
     76             "Kedro's ipython session startup script failed:\n%s", str(err)
     77         )
---> 78         raise err
     79 
     80 

~/Devel/engr-datascience/rate_prediction/v2_kedro/.ipython/profile_default/startup/00-kedro-init.py in reload_kedro(path, line, env, extra_params)
     63         logging.debug("Loading the context from %s", str(path))
     64         context = session.load_context()
---> 65         catalog = context.catalog
     66 
     67         logging.info("** Kedro project %s", str(metadata.project_name))

~/opt/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro/framework/context/context.py in catalog(self)
    327 
    328         """
--> 329         return self._get_catalog()
    330 
    331     @property

~/opt/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro/framework/context/context.py in _get_catalog(self, save_version, journal, load_versions)
    372 
    373         hook_manager = get_hook_manager()
--> 374         catalog = hook_manager.hook.register_catalog(  # pylint: disable=no-member
    375             catalog=conf_catalog,
    376             credentials=conf_creds,

~/opt/miniconda3/envs/kedro/lib/python3.8/site-packages/pluggy/hooks.py in __call__(self, *args, **kwargs)
    284                     stacklevel=2,
    285                 )
--> 286         return self._hookexec(self, self.get_hookimpls(), kwargs)
    287 
    288     def call_historic(self, result_callback=None, kwargs=None, proc=None):

~/opt/miniconda3/envs/kedro/lib/python3.8/site-packages/pluggy/manager.py in _hookexec(self, hook, methods, kwargs)
     91         # called from all hookcaller instances.
     92         # enable_tracing will set its own wrapping function at self._inner_hookexec
---> 93         return self._inner_hookexec(hook, methods, kwargs)
     94 
     95     def register(self, plugin, name=None):

~/opt/miniconda3/envs/kedro/lib/python3.8/site-packages/pluggy/manager.py in <lambda>(hook, methods, kwargs)
     82             )
     83         self._implprefix = implprefix
---> 84         self._inner_hookexec = lambda hook, methods, kwargs: hook.multicall(
     85             methods,
     86             kwargs,

~/opt/miniconda3/envs/kedro/lib/python3.8/site-packages/pluggy/callers.py in _multicall(hook_impls, caller_kwargs, firstresult)
    206                 pass
    207 
--> 208         return outcome.get_result()

~/opt/miniconda3/envs/kedro/lib/python3.8/site-packages/pluggy/callers.py in get_result(self)
     78             ex = self._excinfo
     79             if _py3:
---> 80                 raise ex[1].with_traceback(ex[2])
     81             _reraise(*ex)  # noqa
     82 

~/opt/miniconda3/envs/kedro/lib/python3.8/site-packages/pluggy/callers.py in _multicall(hook_impls, caller_kwargs, firstresult)
    185                         _raise_wrapfail(gen, "did not yield")
    186                 else:
--> 187                     res = hook_impl.function(*args)
    188                     if res is not None:
    189                         results.append(res)

~/Devel/engr-datascience/rate_prediction/v2_kedro/src/v2_kedro/hooks.py in register_catalog(self, catalog, credentials, load_versions, save_version, journal)
     52         journal: Journal,
     53     ) -> DataCatalog:
---> 54         return DataCatalog.from_config(
     55             catalog, credentials, load_versions, save_version, journal
     56         )

~/opt/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro/io/data_catalog.py in from_config(cls, catalog, credentials, load_versions, save_version, journal)
    326 
    327             ds_config = _resolve_credentials(ds_config, credentials)
--> 328             data_sets[ds_name] = AbstractDataSet.from_config(
    329                 ds_name, ds_config, load_versions.get(ds_name), save_version
    330             )

~/opt/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro/io/core.py in from_config(cls, name, config, load_version, save_version)
    177             data_set = class_obj(**config)  # type: ignore
    178         except TypeError as err:
--> 179             raise DataSetError(
    180                 f"\n{err}.\nDataSet '{name}' must only contain arguments valid for the "
    181                 f"constructor of `{class_obj.__module__}.{class_obj.__qualname__}`."

DataSetError: 
__init__() missing 1 required positional argument: 'version'.
DataSet 'test_dt' must only contain arguments valid for the constructor of `v2_kedro.extras.datasets.data_table.MyOwnDataSet`.

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

datajoely commented 3 years ago

Hi @naarkhoo would you be able to post your MyOwnDataSet implementation? I have a sneaking suspicion you haven't implemented all of the required methods defined in the interface.

Are you using an IDE? I personally use PyCharm as it gives you a nudge to implement the bits that are missing:

image

When I add the super class call, it generates the following constructor signature:

image

I think the error you are getting refers to one of the two version arguments I've highlighted below.

__init__() missing 1 required positional argument: 'version'.

datajoely commented 3 years ago

Also - off topic if you are looking to read *.csv.gz files you can do this natively with the existing pandas.CSVDataSet dataset.

naarkhoo commented 3 years ago

That Joel,

I was just copy/pasting the example https://kedro.readthedocs.io/en/latest/kedro.io.AbstractVersionedDataSet.html

the problem was I was using version instead of versioned in my YAML file. I thought the YAML key word is the same as the argument in the class. now it is fixed - I am going ahead to do changes to make it read as datatable

and it seems the parameter versioned can not be set false - or only true is the correct value. even it does not work if I remove it from the YAML files. (it complains). perhaps this behavior deserves another ticket.