Add `sample` parameter to `top` & `rare` command

YANG-DB commented 2 weeks ago

Description

Add a new sample command (sample) to reduce amount of scanned data points and allow approximation of a top or rare statements when faster sample based results if favour of exact long running results

source = testTable  | rare address sample(50 percent)
source = testTable  | top 5 address by country sample(25 percent)

Issues Resolved

https://github.com/opensearch-project/opensearch-spark/issues/740

Check List

[x] Updated documentation (docs/ppl-lang/README.md)
[x] Implemented unit tests
[x] Implemented tests for combination with other commands
[x] New added source code should include a copyright header
[x] Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

LantaoJin commented 2 weeks ago

One high level question: How do we determine the relationship between percentage and precision? Or how much precision does it lose when sampling is decreased from 100% to 80% or from 80% to 50%?

I'm wondering what kind of scenario needs to run top on the sample data.

YANG-DB commented 1 week ago

closing since this has not yet shown to have a significant use case

opensearch-project / opensearch-spark