open-metadata / OpenMetadata

OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.
https://open-metadata.org
Apache License 2.0
5.27k stars 996 forks source link

Enum List Distribution Profiling #13494

Open yogocik opened 11 months ago

yogocik commented 11 months ago

Is your feature request related to a problem? Please describe. I want to get any percentage or count about certain columns enum or distinct values. Let's say I have product_order table which has status column, it is filled with COMPLETED, FAILED, and PENDING. I want to know how much % for COMPLETED/FAILED/PENDING status.

Describe the solution you'd like Maybe we can start from creating or integrating query to identify all distinct values and count them after that.

Describe alternatives you've considered There are some considerations we can define such as allowed type only varchar/string and distinct count metrics enabled. Or we can provide custom query to provide expected result. I think we can also provide some additional features like dictionary which can be used as enum store and can be integrated with other functionality such as TestCase which also can return sample of values (distinct/unique list) to be registered dictionary.

Additional context Slack Thread : https://openmetadata.slack.com/archives/C02B6955S4S/p1696821110580209 Although the metrics is not quite crucial but it will be helpful to get information about distinct value distribution.

joegoldbeck commented 11 months ago

This would be useful! For one of our applications, we want to access the distinct values for a column from the catalog, but the (approximate) % of each value would be even better.