opensearch-project / opensearch-spark

Spark Accelerator framework ; It enables secondary indices to remote data stores.
Apache License 2.0
22 stars 33 forks source link

Add `sample` parameter to `top` & `rare` command #879

Closed YANG-DB closed 1 week ago

YANG-DB commented 2 weeks ago

Description

Add a new sample command (sample) to reduce amount of scanned data points and allow approximation of a top or rare statements when faster sample based results if favour of exact long running results

source = testTable  | rare address sample(50 percent)
source = testTable  | top 5 address by country sample(25 percent)

Issues Resolved

Check List

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

LantaoJin commented 2 weeks ago

One high level question: How do we determine the relationship between percentage and precision? Or how much precision does it lose when sampling is decreased from 100% to 80% or from 80% to 50%?

I'm wondering what kind of scenario needs to run top on the sample data.

YANG-DB commented 1 week ago

closing since this has not yet shown to have a significant use case