Open navneet1v opened 1 year ago
Im wondering if as part of this, we should add search relevance metrics/workload to OSB? For instance, for the text-based queries, one key question this will answer is when to use what and what are the tradeoffs? We could have a generic OSB run where the input/output stays constant (like datasets for BEIR) and we just change the internal implementation. When a new method (i.e. reranker, or different combination logic such as RRF) comes in, we can just plug them into the OSB configuration, run the test and determine where it stacks up.
@jmazanec15 the idea of this issue is to have a high level issue to add the benchmarks. Now what should be used to do the benchmarks like OSB or something else is not decided and I left it open. If we start using OSB then yes we need to get Search relevance metrics in OSB. But we should work with OSB team to provide a capability get these custom metrics.
@jmazanec15 the idea of this issue is to have a high level issue to add the benchmarks. Now what should be used to do the benchmarks like OSB or something else is not decided and I left it open. If we start using OSB then yes we need to get Search relevance metrics in OSB. But we should work with OSB team to provide a capability get these custom metrics.
+1 I think first priority is to come up with benchmarks that help with providing a baseline to quality of search. Regarding OSB as an implementation platform I'm not so sure. It is implemented in Ruby and focuses on stress testing while we are more trying to define metrics of quality. For that even small data sets can do just fine and we can run those even as part of IT tests like the embedded JMH framework would seem more native solution to the task.
+1 I think first priority is to come up with benchmarks that help with providing a baseline to quality of search.
Yes, definitely agree with this.
It is implemented in Ruby and focuses on stress testing while we are more trying to define metrics of quality.
OSB is actually in python. Should be more friendly with existing data sets.
For that even small data sets can do just fine and we can run those even as part of IT tests like the embedded JMH framework would seem more native solution to the task.
Thats interesting - Im not super familiar with it, but could make sense - itd be nice to have as an integ test. I guess I like OSB becuase it would (1) be easier to integrate into automated performance testing infrastructure/metric publishing (2) let users test relevance for their own clusters easier (i.e. just point the OSB workload or a custom workload at their cluster and let it run). But maybe it makes sense to do both.
There has been a PR : https://github.com/opensearch-project/opensearch-benchmark-workloads/pull/232/files
added for doing text_embeddings benchmarks.
Description
The aim of this issue is to write the performance and accuracy benchmarks for different features of Neural search plugin.
Tasks