kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
Apache License 2.0
9.53k stars 877 forks source link

cant reproduce the AbstractVersionedDataSet example #842

Closed naarkhoo closed 3 years ago

naarkhoo commented 3 years ago


short: I can't reproduce AbstractVersionedDataSet example provided


I was trying to read files as datatabe - I have been getting some other related to the other example - saying the object doesn't have .load attribute.

Steps to Reproduce

  1. make file structure similar to what is described (src/package_name/extracs/datasets)

  2. make a yaml file like

    type: v2_kedro.extras.datasets.data_table.MyOwnDataSet
    filepath: data/01_raw/myfile.csv.gz
    version: '1'
  3. [And so on...]

Expected Result

to read the csv file and print

Actual Result

__init__() missing 1 required positional argument: 'version'.
DataSet 'test_dt' must only contain arguments valid for the constructor of `v2_kedro.extras.datasets.data_table.MyOwnDataSet`.
[TerminalIPythonApp] WARNING | Unknown error in handling startup files:
TypeError                                 Traceback (most recent call last)
~/opt/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro/io/ in from_config(cls, name, config, load_version, save_version)
    176         try:
--> 177             data_set = class_obj(**config)  # type: ignore
    178         except TypeError as err:

TypeError: __init__() missing 1 required positional argument: 'version'

The above exception was the direct cause of the following exception:

DataSetError                              Traceback (most recent call last)
~/opt/miniconda3/envs/kedro/lib/python3.8/site-packages/IPython/core/ in _exec_file(self, fname, shell_futures)
    335                     else:
    336                         # default to python, even without extension
--> 337               ,
    338                                        ,
    339                                                  shell_futures=shell_futures,

~/opt/miniconda3/envs/kedro/lib/python3.8/site-packages/IPython/core/ in safe_execfile(self, fname, exit_ignore, raise_exceptions, shell_futures, *where)
   2708             try:
   2709                 glob, loc = (where + (None, ))[:2]
-> 2710                 py3compat.execfile(
   2711                     fname, glob, loc,
   2712                     self.compile if shell_futures else None)

~/opt/miniconda3/envs/kedro/lib/python3.8/site-packages/IPython/utils/ in execfile(fname, glob, loc, compiler)
    186     with open(fname, 'rb') as f:
    187         compiler = compiler or compile
--> 188         exec(compiler(, fname, 'exec'), glob, loc)
    190 # Refactor print statements in doctests.

~/Devel/engr-datascience/rate_prediction/v2_kedro/.ipython/profile_default/startup/ in <module>
---> 81 reload_kedro(project_path)

~/opt/miniconda3/envs/kedro/lib/python3.8/site-packages/ in fun(*args, **kw)
    230             if not kwsyntax:
    231                 args, kw = fix(args, kw, sig)
--> 232             return caller(func, *(extras + args), **kw)
    233     fun.__name__ = func.__name__
    234     fun.__doc__ = func.__doc__

~/opt/miniconda3/envs/kedro/lib/python3.8/site-packages/IPython/core/ in <lambda>(f, *a, **k)
    218     # but it's overkill for just that one bit of state.
    219     def magic_deco(arg):
--> 220         call = lambda f, *a, **k: f(*a, **k)
    222         # Find get_ipython() in the caller's namespace

~/Devel/engr-datascience/rate_prediction/v2_kedro/.ipython/profile_default/startup/ in reload_kedro(path, line, env, extra_params)
     76             "Kedro's ipython session startup script failed:\n%s", str(err)
     77         )
---> 78         raise err

~/Devel/engr-datascience/rate_prediction/v2_kedro/.ipython/profile_default/startup/ in reload_kedro(path, line, env, extra_params)
     63         logging.debug("Loading the context from %s", str(path))
     64         context = session.load_context()
---> 65         catalog = context.catalog
     67"** Kedro project %s", str(metadata.project_name))

~/opt/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro/framework/context/ in catalog(self)
    328         """
--> 329         return self._get_catalog()
    331     @property

~/opt/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro/framework/context/ in _get_catalog(self, save_version, journal, load_versions)
    373         hook_manager = get_hook_manager()
--> 374         catalog = hook_manager.hook.register_catalog(  # pylint: disable=no-member
    375             catalog=conf_catalog,
    376             credentials=conf_creds,

~/opt/miniconda3/envs/kedro/lib/python3.8/site-packages/pluggy/ in __call__(self, *args, **kwargs)
    284                     stacklevel=2,
    285                 )
--> 286         return self._hookexec(self, self.get_hookimpls(), kwargs)
    288     def call_historic(self, result_callback=None, kwargs=None, proc=None):

~/opt/miniconda3/envs/kedro/lib/python3.8/site-packages/pluggy/ in _hookexec(self, hook, methods, kwargs)
     91         # called from all hookcaller instances.
     92         # enable_tracing will set its own wrapping function at self._inner_hookexec
---> 93         return self._inner_hookexec(hook, methods, kwargs)
     95     def register(self, plugin, name=None):

~/opt/miniconda3/envs/kedro/lib/python3.8/site-packages/pluggy/ in <lambda>(hook, methods, kwargs)
     82             )
     83         self._implprefix = implprefix
---> 84         self._inner_hookexec = lambda hook, methods, kwargs: hook.multicall(
     85             methods,
     86             kwargs,

~/opt/miniconda3/envs/kedro/lib/python3.8/site-packages/pluggy/ in _multicall(hook_impls, caller_kwargs, firstresult)
    206                 pass
--> 208         return outcome.get_result()

~/opt/miniconda3/envs/kedro/lib/python3.8/site-packages/pluggy/ in get_result(self)
     78             ex = self._excinfo
     79             if _py3:
---> 80                 raise ex[1].with_traceback(ex[2])
     81             _reraise(*ex)  # noqa

~/opt/miniconda3/envs/kedro/lib/python3.8/site-packages/pluggy/ in _multicall(hook_impls, caller_kwargs, firstresult)
    185                         _raise_wrapfail(gen, "did not yield")
    186                 else:
--> 187                     res = hook_impl.function(*args)
    188                     if res is not None:
    189                         results.append(res)

~/Devel/engr-datascience/rate_prediction/v2_kedro/src/v2_kedro/ in register_catalog(self, catalog, credentials, load_versions, save_version, journal)
     52         journal: Journal,
     53     ) -> DataCatalog:
---> 54         return DataCatalog.from_config(
     55             catalog, credentials, load_versions, save_version, journal
     56         )

~/opt/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro/io/ in from_config(cls, catalog, credentials, load_versions, save_version, journal)
    327             ds_config = _resolve_credentials(ds_config, credentials)
--> 328             data_sets[ds_name] = AbstractDataSet.from_config(
    329                 ds_name, ds_config, load_versions.get(ds_name), save_version
    330             )

~/opt/miniconda3/envs/kedro/lib/python3.8/site-packages/kedro/io/ in from_config(cls, name, config, load_version, save_version)
    177             data_set = class_obj(**config)  # type: ignore
    178         except TypeError as err:
--> 179             raise DataSetError(
    180                 f"\n{err}.\nDataSet '{name}' must only contain arguments valid for the "
    181                 f"constructor of `{class_obj.__module__}.{class_obj.__qualname__}`."

__init__() missing 1 required positional argument: 'version'.
DataSet 'test_dt' must only contain arguments valid for the constructor of `v2_kedro.extras.datasets.data_table.MyOwnDataSet`.

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

datajoely commented 3 years ago

Hi @naarkhoo would you be able to post your MyOwnDataSet implementation? I have a sneaking suspicion you haven't implemented all of the required methods defined in the interface.

Are you using an IDE? I personally use PyCharm as it gives you a nudge to implement the bits that are missing:


When I add the super class call, it generates the following constructor signature:


I think the error you are getting refers to one of the two version arguments I've highlighted below.

__init__() missing 1 required positional argument: 'version'.

datajoely commented 3 years ago

Also - off topic if you are looking to read *.csv.gz files you can do this natively with the existing pandas.CSVDataSet dataset.

naarkhoo commented 3 years ago

That Joel,

I was just copy/pasting the example

the problem was I was using version instead of versioned in my YAML file. I thought the YAML key word is the same as the argument in the class. now it is fixed - I am going ahead to do changes to make it read as datatable

and it seems the parameter versioned can not be set false - or only true is the correct value. even it does not work if I remove it from the YAML files. (it complains). perhaps this behavior deserves another ticket.