tamsanh / kedro-great

The easiest way to integrate Kedro and Great Expectations
MIT License
52 stars 14 forks source link

Support for in-memory datasets #1

Closed akruszewski closed 4 years ago

akruszewski commented 4 years ago

@tamsanh do you have plans for support in-memory datasets?

Context

After setup of kedro-great for kedro (kedro great init) and running pipeline kedro run, I'm getting an error:

 Traceback (most recent call last):
   File "/home/USER_NAME/miniconda3/envs/kedro-and-kubeflow-env/bin/kedro", line 8, in <module>
     sys.exit(main())
   File "/home/USER_NAME/miniconda3/envs/kedro-and-kubeflow-env/lib/python3.7/site-packages/kedro/framework/cli/cli.py", line 633, in main
     cli_collection()
   File "/home/USER_NAME/miniconda3/envs/kedro-and-kubeflow-env/lib/python3.7/site-packages/click/core.py", line 829, in __call__
     return self.main(*args, **kwargs)
   File "/home/USER_NAME/miniconda3/envs/kedro-and-kubeflow-env/lib/python3.7/site-packages/click/core.py", line 782, in main
     rv = self.invoke(ctx)
   File "/home/USER_NAME/miniconda3/envs/kedro-and-kubeflow-env/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
     return _process_result(sub_ctx.command.invoke(sub_ctx))
   File "/home/USER_NAME/miniconda3/envs/kedro-and-kubeflow-env/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
     return ctx.invoke(self.callback, **ctx.params)
   File "/home/USER_NAME/miniconda3/envs/kedro-and-kubeflow-env/lib/python3.7/site-packages/click/core.py", line 610, in invoke
     return callback(*args, **kwargs)
   File "/mnt/c/dev/kedro-and-kubeflow/kedro_cli.py", line 231, in run
     pipeline_name=pipeline,
   File "/home/USER_NAME/miniconda3/envs/kedro-and-kubeflow-env/lib/python3.7/site-packages/kedro/framework/context/context.py", line 699, in run
     raise error
   File "/home/USER_NAME/miniconda3/envs/kedro-and-kubeflow-env/lib/python3.7/site-packages/kedro/framework/context/context.py", line 691, in run
     run_result = runner.run(filtered_pipeline, catalog, run_id)
   File "/home/USER_NAME/miniconda3/envs/kedro-and-kubeflow-env/lib/python3.7/site-packages/kedro/runner/runner.py", line 101, in run
     self._run(pipeline, catalog, run_id)
   File "/home/USER_NAME/miniconda3/envs/kedro-and-kubeflow-env/lib/python3.7/site-packages/kedro/runner/sequential_runner.py", line 90, in _run
     run_node(node, catalog, self._is_async, run_id)
   File "/home/USER_NAME/miniconda3/envs/kedro-and-kubeflow-env/lib/python3.7/site-packages/kedro/runner/runner.py", line 213, in run_node
     node = _run_node_sequential(node, catalog, run_id)
   File "/home/USER_NAME/miniconda3/envs/kedro-and-kubeflow-env/lib/python3.7/site-packages/kedro/runner/runner.py", line 245, in _run_node_sequential
     run_id=run_id,
   File "/home/USER_NAME/miniconda3/envs/kedro-and-kubeflow-env/lib/python3.7/site-packages/pluggy/hooks.py", line 286, in __call__
     return self._hookexec(self, self.get_hookimpls(), kwargs)
   File "/home/USER_NAME/miniconda3/envs/kedro-and-kubeflow-env/lib/python3.7/site-packages/pluggy/manager.py", line 93, in _hookexec
     return self._inner_hookexec(hook, methods, kwargs)
   File "/home/USER_NAME/miniconda3/envs/kedro-and-kubeflow-env/lib/python3.7/site-packages/pluggy/manager.py", line 87, in <lambda>
     firstresult=hook.spec.opts.get("firstresult") if hook.spec else False,
   File "/home/USER_NAME/miniconda3/envs/kedro-and-kubeflow-env/lib/python3.7/site-packages/pluggy/callers.py", line 208, in _multicall
     return outcome.get_result()
   File "/home/USER_NAME/miniconda3/envs/kedro-and-kubeflow-env/lib/python3.7/site-packages/pluggy/callers.py", line 80, in get_result
     raise ex[1].with_traceback(ex[2])
   File "/home/USER_NAME/miniconda3/envs/kedro-and-kubeflow-env/lib/python3.7/site-packages/pluggy/callers.py", line 187, in _multicall
     res = hook_impl.function(*args)
   File "/home/USER_NAME/miniconda3/envs/kedro-and-kubeflow-env/lib/python3.7/site-packages/kedro_great/kedro_great.py", line 88, in after_node_run
     self._run_validation(catalog, outputs, run_id)
   File "/home/USER_NAME/miniconda3/envs/kedro-and-kubeflow-env/lib/python3.7/site-packages/kedro_great/kedro_great.py", line 102, in _run_validation
     dataset_path = str(dataset._filepath)
 AttributeError: 'MemoryDataSet' object has no attribute '_filepath'

After fast skim through the repo I figured out (correct me if I'm wrong), that just datesets which have _filepath are supported.

tamsanh commented 4 years ago

@akruszewski Actually, I think it is possible. Do you have an example pipeline/validation that you're using?

akruszewski commented 4 years ago

@tamsanh Unfortunately I can't share it, But the scenario for most nodes is:

Let me know if you need more info.

tamsanh commented 4 years ago

@akruszewski I just pushed a new version of the repo. Try doing a pip install -U kedro-great. You should get 0.2.2, which will support datasets that do not have a _filepath attribute.

akruszewski commented 4 years ago

@tamsanh with your change I'm still getting error. This time:


Traceback (most recent call last):
  File "/home/kruszewa/miniconda3/envs/kedro-and-kubeflow-env/bin/kedro", line 8, in <module>
    sys.exit(main())
  File "/home/kruszewa/miniconda3/envs/kedro-and-kubeflow-env/lib/python3.7/site-packages/kedro/framework/cli/cli.py", line 633, in main
    cli_collection()
  File "/home/kruszewa/miniconda3/envs/kedro-and-kubeflow-env/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/kruszewa/miniconda3/envs/kedro-and-kubeflow-env/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/kruszewa/miniconda3/envs/kedro-and-kubeflow-env/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/kruszewa/miniconda3/envs/kedro-and-kubeflow-env/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/kruszewa/miniconda3/envs/kedro-and-kubeflow-env/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/mnt/c/dev/kedro-and-kubeflow/kedro_cli.py", line 230, in run
    pipeline_name=pipeline,
  File "/home/kruszewa/miniconda3/envs/kedro-and-kubeflow-env/lib/python3.7/site-packages/kedro/framework/context/context.py", line 699, in run
    raise error
  File "/home/kruszewa/miniconda3/envs/kedro-and-kubeflow-env/lib/python3.7/site-packages/kedro/framework/context/context.py", line 691, in run
    run_result = runner.run(filtered_pipeline, catalog, run_id)
  File "/home/kruszewa/miniconda3/envs/kedro-and-kubeflow-env/lib/python3.7/site-packages/kedro/runner/runner.py", line 101, in run
    self._run(pipeline, catalog, run_id)
  File "/home/kruszewa/miniconda3/envs/kedro-and-kubeflow-env/lib/python3.7/site-packages/kedro/runner/sequential_runner.py", line 90, in _run
    run_node(node, catalog, self._is_async, run_id)
  File "/home/kruszewa/miniconda3/envs/kedro-and-kubeflow-env/lib/python3.7/site-packages/kedro/runner/runner.py", line 213, in run_node
    node = _run_node_sequential(node, catalog, run_id)
  File "/home/kruszewa/miniconda3/envs/kedro-and-kubeflow-env/lib/python3.7/site-packages/kedro/runner/runner.py", line 245, in _run_node_sequential
    run_id=run_id,
  File "/home/kruszewa/miniconda3/envs/kedro-and-kubeflow-env/lib/python3.7/site-packages/pluggy/hooks.py", line 286, in __call__
    return self._hookexec(self, self.get_hookimpls(), kwargs)
  File "/home/kruszewa/miniconda3/envs/kedro-and-kubeflow-env/lib/python3.7/site-packages/pluggy/manager.py", line 93, in _hookexec
    return self._inner_hookexec(hook, methods, kwargs)
  File "/home/kruszewa/miniconda3/envs/kedro-and-kubeflow-env/lib/python3.7/site-packages/pluggy/manager.py", line 87, in <lambda>
    firstresult=hook.spec.opts.get("firstresult") if hook.spec else False,
  File "/home/kruszewa/miniconda3/envs/kedro-and-kubeflow-env/lib/python3.7/site-packages/pluggy/callers.py", line 208, in _multicall
    return outcome.get_result()
  File "/home/kruszewa/miniconda3/envs/kedro-and-kubeflow-env/lib/python3.7/site-packages/pluggy/callers.py", line 80, in get_result
    raise ex[1].with_traceback(ex[2])
  File "/home/kruszewa/miniconda3/envs/kedro-and-kubeflow-env/lib/python3.7/site-packages/pluggy/callers.py", line 187, in _multicall
    res = hook_impl.function(*args)
  File "/home/kruszewa/miniconda3/envs/kedro-and-kubeflow-env/lib/python3.7/site-packages/kedro_great/kedro_great.py", line 88, in after_node_run
    self._run_validation(catalog, outputs, run_id)
  File "/home/kruszewa/miniconda3/envs/kedro-and-kubeflow-env/lib/python3.7/site-packages/kedro_great/kedro_great.py", line 103, in _run_validation
    df = dataset.load()
  File "/home/kruszewa/miniconda3/envs/kedro-and-kubeflow-env/lib/python3.7/site-packages/kedro/io/core.py", line 213, in load
    return self._load()
  File "/home/kruszewa/miniconda3/envs/kedro-and-kubeflow-env/lib/python3.7/site-packages/kedro/io/memory_data_set.py", line 81, in _load
    raise DataSetError("Data for MemoryDataSet has not been saved yet.")
kedro.io.core.DataSetError: Data for MemoryDataSet has not been saved yet.

Mine solution for that is to replace dataset.load() with dataset_value when _filepath attribute is None. I'm still not super familiar with your plugin, so I'm not sure if it would not break anything. Anyway PR: https://github.com/tamsanh/kedro-great/pull/2