kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.88k stars 897 forks source link

[KED-1367] Integration with great-expectations #207

Closed EigenJT closed 3 years ago

EigenJT commented 4 years ago

Description

I have been using kedro for a little while now for data engineering/cleaning. A standard step in these processes is testing the data at different steps of the pipeline. To do this, I've been using great-expectations to write expectations pipelines that are essentially slotted in between different cleaning/engineering steps. It would be great to have a way to point a kedro.io dataset type towards a suite of expectations, as defined in great_expectations.

Context

Testing data is a pretty essential step of data pipelines. great-expectations offers a really nice suite of tools for communicating and testing what is expected out of a dataset/pipeline.

Possible Implementation

(Optional) The first method that jumps to mind is extending dataset types in kedro.io to use expectation suites. In particular, this could be done by extending the _save() method to run a set of expectations on a dataset every time it is saved, as well as saving the results of the run to be used in great_expectations visualization features. Locations of expectations suites would be another attribute added when defining the dataset in the Data Catalog. Same idea as filepath: data/... i.e. expectation_suites: -.../...

Possible Alternatives

(Optional) No idea where to start here, but an alternative path may be a plugin.

yetudada commented 4 years ago

Hi @EigenJT! This is fantastic to see. We've been checking out Great Expectations for a while and the team is interested in working with us.

It actually sounds like a transformer might be best suited to modify the _save() method on your dataset: https://kedro.readthedocs.io/en/latest/04_user_guide/04_data_catalog.html#transforming-datasets

And these transformers could be added to Kedro's contrib library so that they're more accessible for you. But I'll chat to the Kedro and Great Expectations team about this.

EigenJT commented 4 years ago

@yetudada That's great to hear! I didn't know about transformers, so I'll have a look and give it a go. Otherwise, if you guys need any help I'd love to contribute in any way I can!

yetudada commented 4 years ago

@EigenJT You can definitely take a stab at it in the meanwhile. We'll appreciate the help!

JasonLeungQB commented 4 years ago

I will take a stab on incorporating Great expectations to Kedro with Deepyaman Datta.

JasonLeungQB commented 4 years ago

Deepyaman and I did some testings, a possible implementation can be to add a transformer to _create_catalog. In the transformer, we can import great_expectations and log the results.

In my development version, I'm using the SubdirReaderGenerator generator from great_expectations.yml and have to hard-code the data_asset_name in order for great expectations to find the data set that I want it to validate. Do you think it makes sense if we create another great expectations generator to read the Kedro catalog and have great expectations use the dataset name?

Eventually, we would like to add this functionality to the Kedro codebase.

EigenJT commented 4 years ago

Deepyaman and I did some testings, a possible implementation can be to add a transformer to _create_catalog. In the transformer, we can import great_expectations and log the results.

In my development version, I'm using the SubdirReaderGenerator generator from great_expectations.yml and have to hard-code the data_asset_name in order for great expectations to find the data set that I want it to validate. Do you think it makes sense if we create another great expectations generator to read the Kedro catalog and have great expectations use the dataset name?

Eventually, we would like to add this functionality to the Kedro codebase.

@JasonLeungQB I've got a minimally working transformer that uses the Kedro DataCatalog to inform great_expectations where datasets are by passing the filepath to the great_expectations DataContext.normalize_data_asset_name method. Works alright.

As for adding the actual transformer, I've modified the _create_catalog method in the Project Context. I'll commit some stuff soon...

deepyaman commented 4 years ago

@JasonLeungQB I've got a minimally working transformer that uses the Kedro DataCatalog to inform great_expectations where datasets are by passing the filepath to the great_expectations DataContext.normalize_data_asset_name method. Works alright.

As for adding the actual transformer, I've modified the _create_catalog method in the Project Context. I'll commit some stuff soon...

@EigenJT Great to hear that you've got something working (even if minimally)! Look forward to seeing it.

Can you share what your catalog looks like, or at least what generator you're using? I'm curious what kinds of assumptions (if any) you're making in using normalize_data_asset_name, as that was kind of where we got to, too.

JasonLeungQB commented 4 years ago

@EigenJT Thank you for the update. Our first version is to test the data using a Panda dataframe, according to documentation, we can use an in-memory pandas dataframe without using the generator to set up the batch_kwargs

EigenJT commented 4 years ago

@deepyaman Here's a snippet of the Kedro catalog.yml. In order to use great_expectations, I had the transformer essentially replicate some of the functionality from great_expectations init. This included creating a great_expectations.yml, with the appropriate datasources/generators set up. As long as data is in the Kedro data folder great_expectations should be able to access it.

catalog.yml

  type: CSVLocalDataSet
  filepath: "data/01_raw/test_data_set_name.csv"
  expectations_suite: 
    filepath: "data/00_expectations_suites/test_expectations_suite_name.json"
    break_on_failure: False

great_expectations.yml

  data:
    class_name: PandasDatasource
    data_asset_type:
      class_name: PandasDataset
    generators:
      01_raw:
        class_name: SubdirReaderGenerator
        base_directory: data/01_raw
        reader_options:
          sep:
          engine: python
      02_intermediate:
        class_name: SubdirReaderGenerator
        base_directory: data/02_intermediate
        reader_options:
          sep:
          engine: python

So with this catalog.yml and great_expectations.yml any dataset can be found as long as it's filepath looks likedata/01_raw(or other directories in the data directory)/.../filename.csv

deepyaman commented 4 years ago

@EigenJT Thanks for sharing. I see that you're using SubdirReaderGenerator (with whatever reader_options set) to generate batches. This works fine as long as each dataset can be read using reader_options, and falls under the set of file types supported by Great Expectations. Since Kedro's data catalog defines more specific ways to read data, I was thinking that it might make sense to have on the roadmap a KedroDataCatalogGenerator, that, given a catalog directory, generates batches configured based on each Kedro DataSets load options. Given your experience, do you think that would make sense?

Alternatively--and what may be even simpler--one can configure a Great Expectations dataset directly in a transformer based on the loaded data type (Pandas DataFrame, Spark DataFrame, etc.). I think @JasonLeungQB got that working and can share a snippet shortly.

JasonLeungQB commented 4 years ago

@EigenJT We got it working on the in-memory Pandas dataset in the transformer. In order to apply ge built-in Expectations, we need to convert the data to a ge dataset.

Here is a snippet of the code with two test Expectations (in the transformer) using the Iris.csv dataset.

def load(self, data_set_name: str, load: Callable[[], Any]) -> Any:
    data = load()

    df_ge = ge.dataset.PandasDataset(data)

    permit_subset = ['setosa', 'virginica', 'versicolor']
    print(df_ge.expect_column_values_to_be_in_set(df_ge.get_table_columns()[4], permit_subset))
    print(df_ge.expect_column_values_to_not_be_null('sepal_length'))

    return data
EigenJT commented 4 years ago

@deepyaman yeah right now, my implementation only works with datasets that can be read in using Pandas. I think the KedroDataCatalogGenerator is a great idea. I've committed everything to my repo, so you guys can have a look. The README is a little sparse, but it all works. https://github.com/EigenJT/kedro/tree/develop/kedro/contrib/great_expectations_transformer Definitely more on the simple side, a little hacked together. I was considering writing some kind of mapping between great_expectations datasets and Kedro datasets, but I'm not sure where to start, exactly.

yetudada commented 4 years ago

@EigenJT @JasonLeungQB @deepyaman You'll be excited to note that we have built a kedro-ge plugin internally, that we will be open-sourcing. We're actually talking to the Great Expectations team on this work, and it will be a collaborative effort when we release the plugin.

@ZainPatelQB @tsanikgr are behind this amazing work.

EigenJT commented 4 years ago

@EigenJT @JasonLeungQB @deepyaman You'll be excited to note that we have built a kedro-ge plugin internally, that we will be open-sourcing. We're actually talking to the Great Expectations team on this work, and it will be a collaborative effort when we release the plugin.

@ZainPatelQB @tsanikgr are behind this amazing work.

@yetudada That's great to hear, I can't wait to see it!

neomatrix369 commented 4 years ago

This is super cool, would love to see this feature integrated.

mzjp2 commented 4 years ago

Quick update on the work we're doing here. We've done an internal release a week or two ago with support for validating datasets as part of your pipeline, declaring the actions you want to be taken in a config file.

We're dogfooding this intensely at the moment, with lots of internal feedback and are planning on doing quite a lot more work, but inching closer to open sourcing!

GE is also releasing a pretty big breaking 0.11 change soon which we'll have to spend some time catching up with. Glad to see the enthusiasm in this thread!

turn1a commented 4 years ago

@yetudada @mzjp2 any updates on open sourcing the plugin? I've been considering writing a Kedro & Great Expectations integration but I don't want to repeat the work if it's already done. I would happily become a beta tester if such possibility exists.

fgsilva commented 4 years ago

@yetudada @mzjp2 any updates on open sourcing the plugin? I've been considering writing a Kedro & Great Expectations integration but I don't want to repeat the work if it's already done. I would happily become a beta tester if such possibility exists.

@kaemo same here. Still, from what I've read, I am not sure if the plugin will support all types of datasets, including SparkDataSet. Perhaps @yetudada and @mzjp2 could clarify?

Minyus commented 4 years ago

@tamsanh has open-sourced Kedro hooks to use Great Expections as kedro-great. Not sure about the difference between kedro-great and kedro-ge.

mzjp2 commented 4 years ago

Hey all, we're still testing things internally and have been seeing promising results. To clarify, we support all pandas.* datasets alongside SparkDataSet (@fgsilva)

Our general plan is to structure a sprint around productionising and getting the internal kedro-ge plugin ready for open-source, but I don't have anything more concrete to share right now 🙈

noklam commented 4 years ago

checking in to see if this plugin will be released soon?

noklam commented 4 years ago

Is there more updated information about this integration? I am trying @tamsanh kedro-great at the moment for some PoC.

Is there some tricky part integrating with ge that I should pay attention to?

laisbsc commented 4 years ago

Hey there @noklam, thank you for reaching out!

We are currently assessing our roadmap for news on the integration release.

About kedro-great you might find this video helpful. It is a walkthrough the first-steps on using the plugin.

Let us know if you run into any issues.

Chat soon!

aprams commented 3 years ago

@laisbsc Yet another ping - sorry! I tried kedro-great, which is fine, but rather limited in functionality still, considering the broad range of Kedro datasets. Do you have a rough timeline/gut feeling on the release roadmap?

Considering kedro positions itself as "Software Engineering best practices meet data pipelines", a solid (data) testing framework would highly contribute to that vision from my point of view.

merelcht commented 3 years ago

We'll close this ticket for now as this is not something we would be building in the foreseeable future. We'll re-open it again if our roadmap changes. In the meantime, we encourage users to checkout kedro-great or try any other community built plugins.