open-metadata / OpenMetadata

OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.
https://open-metadata.org
Apache License 2.0
5.01k stars 953 forks source link

When profiling Snowflake, allow users to choose between system and bernoulli sampling. #8428

Open hendrix04 opened 1 year ago

hendrix04 commented 1 year ago

Is your feature request related to a problem? Please describe. Task #8336 made a change to use bernoulli for sampling so that views could be sampled. Turns out that the performance hit due to this is much larger than one might have expected. One of my table scans went from a 1 minute scan to a 13 minute scan due to the change.

Describe the solution you'd like When profiling a Snowflake table, allow users to choose what type of sampling that they want to use.

Describe alternatives you've considered Spending a lot of money on long running queries ;)

TeddyCr commented 1 year ago

@hendrix04 thanks for opening this. I think users should be able to choose (i.e. we should not force one or the other). While it takes longer for BERNOULLI to run, SYSTEM has one big drawback in which not all rows have the same probability to be picked up in the sampling (which is not the case with BERNOUILLI).

As both have tradeoffs, we should let user decide. 🙂 I'll circle back with more details.

hendrix04 commented 1 year ago

@TeddyCr, Any more details here?

TeddyCr commented 1 year ago

Hey @hendrix04, we'll need to do a few things:

  1. Update the profiler metadata pipeline json schema [here]. Here we'll need to add 2 things:
    • add Include views field. You can check what we do for the databaseServiceMetadataPipeline.json From there you can update the filter_entities to exclude views if views should be excluded. The table entity has a table type field table.json.

https://github.com/open-metadata/OpenMetadata/blob/158bd4b9cd5a7fae71dd10d3de1dc7520fa659d3/ingestion/src/metadata/orm_profiler/api/workflow.py#L318-L353

Let me know if you have other questions

hendrix04 commented 1 year ago

@TeddyCr, do you think I should do #8429 in the same PR as this task?

TeddyCr commented 1 year ago

Let's tackle them separately. 🙂

nakaken-churadata commented 1 month ago

Can I try this issue for my next challenge?

ayush-shah commented 1 month ago

Sure @nakaken-churadata, go ahead, Let me know if you require any help around the same and if you need more context here