opensearch-project / opensearch-benchmark

OpenSearch Benchmark - a community driven, open source project to run performance tests for OpenSearch
https://opensearch.org/docs/latest/benchmark/
Apache License 2.0
98 stars 71 forks source link

Make output metrics extendable #199

Open jmazanec15 opened 2 years ago

jmazanec15 commented 2 years ago

Is your feature request related to a problem? Please describe.

For the k-NN plugin, I am working on adding a custom runner that will execute queries from a numeric data set and calculate the recall. k-NN plugin has an assortment of Approximate Nearest Neighbor algorithms. Generally, users will need to make tradeoffs between the approximateness of their system and the latency/throughput - so they need the ability to see both of these metrics when benchmarking.

In the custom query runner, I return the recall alongside the latency, but this only gets stored as request meta data - not as an outputted result.

Describe the solution you'd like

I would like the ability to specify that "recall" should be output as a metric in the results and define the aggregation as taking the mean average.

In a more general sense, I would like the ability to be able to define custom metrics for runners and define their aggregations and get them to show up in the results.

Describe alternatives you've considered

  1. Find the metadata and collect my metric this way -- requires a lot of manual effort

Additional context

  1. https://github.com/opensearch-project/k-NN/pull/409
  2. https://github.com/opensearch-project/opensearch-benchmark/issues/103
gkamat commented 2 years ago

This is going to require flexibility in how the results metrics are defined, computed, processed and reported. It will take some consideration.

jmazanec15 commented 2 years ago

Right, I guess there are a few other applications I can think of that may require similar functionality: Anomaly Detection, Learning to Rank. For these, recall/accuracy are KPIs.

amitgalitz commented 2 years ago

+1 on this. extendable metrics would help Anomaly Detection as well, we are starting to define how we benchmark AD in various ways such as our own execution time to get an anomaly result, recall/precision, and other KPIs on our own specific workloads and as a detector is running. I also want to add that this will greatly benefit ML-Commons as well

cgchinmay commented 8 months ago

@IanHoang as discussed offline, taking a look at this issue

peteralfonsi commented 6 months ago

Added a new issue https://github.com/opensearch-project/opensearch-benchmark/issues/435 which would allow the user to specify percentiles they want to see, which would be a subset of this issue