Support for streaming (Kafka) datasource

great-expectations / great_expectations

Always know what to expect from your data.

https://docs.greatexpectations.io/

Apache License 2.0

10.02k stars 1.55k forks source link

Support for streaming (Kafka) datasource #1645

Closed shakti-garg closed 2 years ago

shakti-garg commented 4 years ago

Is your feature request related to a problem? Please describe. Some of the teams are publishing their data using Kafka topics. We want to run data quality validations over them to find quality metrics (at run-time) which can be aggregated over time periods to provide historical view.

Describe the solution you'd like Extend data source to a kafka specific data source.

github-actions[bot] commented 3 years ago

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?\n\nThis issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

martimors commented 3 years ago

Any chance we could re-open this? Seems like a useful feature!

shakti-garg commented 3 years ago

I raised this feature request last year as it was in need for work-related project. Though we later implemented an in-house custom implementation in KStreams using great_expectations core package. Unfortunately, we were not able to open-source it.

@dingobar can you also share your functional usecase, for which you think this feature will fit?

@jcampbell @abegong Do you think that it is still a useful and needed feature? If yes, I can contribute in its development!

martimors commented 3 years ago

In our enterprise, Kafka is the immutable backbone of our data infrastructure. It would be cool to automatically test on this data without moving it out of Kafka. Does this fit the use-case of GE?

bhcastleton commented 2 years ago

This is a common question from members of the Great Expectations community. We'd love to see more ideas on how this could be implemented. Currently, it's not at the top of the backlog for the core team, but community ideas and contributions would be welcome.

shakti-garg commented 2 years ago

@bhcastleton In my opinion, before going for implementation, we should be on same page for the outcome/usecase we are trying to achieve. Can you please help with a specific problem statement that other community members are looking for?

@dingobar On same note, can you be more specific on end-to-end usecase. For instance, if we are able to test the streaming data in real-time, what is the business value you are aiming for? What is the feedback loop going to be?

I tell you my experience. We implemented internally for a usecase an year back using GE library on a KStreams app. It was very cool to publish quality metrics in real-time and helped consumers to assess the quality of a streaming dataset. Though over the time, one hard reality became evident. As it was immutable streaming data, it was hard for producer or consumer to react to any quality issue or fix it. In short, It just become a monitoring tool which can raise alerts when quality of dataset drops below a benchmark.

mparikhcloudbeds commented 2 years ago

hi @shakti-garg , thanks for sharing your experience.

Part of our Stream governance mandates that we should be able to assess quality of our data at each touchpoint of the data systems. I agree that Kafka being immutable distributed log store, we can't change original data but on the flip side our compliance does mandate to have auditable means of stream data history, measurable metrics around the quality of data and also gathering lineage.

I must say being able to gauge quality of data in motion and having ways to actually park undesired data into some set of buckets would be desirable. Now GE may have subset of these goals as a framework which is still useful to have implemented as an injectable library for Kafka.