[FEATURE] A new workload for vector embedding and search

opensearch-project / opensearch-benchmark-workloads

Official workloads used by OpenSearch Benchmark (OSB)

https://opensearch.org/docs/latest/benchmark/

11 stars 58 forks source link

[FEATURE] A new workload for vector embedding and search #198

Open vpehkone opened 4 months ago

vpehkone commented 4 months ago

Is your feature request related to a problem?

There is not any workload that would test vector search and vector embedding. E.g. run a similar benchmark test as this neural search tutorial (https://opensearch.org/docs/latest/search-plugins/neural-search-tutorial/) does. The current vector search workload does not do vector embedding and requires manually downloading the dataset and converting it to the right format.

What solution would you like?

Create a new workload that uses a pretrained model for vector embedding and executes vector search. This does not require any change to the OpenSearch-Benchmark either the official OpenSearch docker image as there are already ml-common, neural-search and KNN-plugins.

Good dataset for this workload: https://microsoft.github.io/msmarco/ Documents: https://msmarco.blob.core.windows.net/msmarcoranking/collection.tar.gz Query texts: https://msmarco.blob.core.windows.net/msmarcoranking/queries.tar.gz

I can implement this and do PR.

VijayanB commented 4 months ago

@vpehkone We recently created a new workload for vector search https://github.com/opensearch-project/opensearch-benchmark-workloads/tree/main/vectorsearch . Currently you can bring your own dataset in hdf5 format and use it in this workload. However, we don't support any dataset out of the box like nyc taxi at this moment. Please let us know if you see any gap in using this workload for your use case.

ylwu-amzn commented 4 months ago

@vpehkone Thanks, for ml-commons plugin, the maintainer team busy with some other tasks now. They can come back on this task later. Or if you have bandwidth, feel free to contribute for vector embedding generation benchmarking.

akashsha1 commented 4 months ago

@vpehkone We recently created a new workload for vector search https://github.com/opensearch-project/opensearch-benchmark-workloads/tree/main/vectorsearch . Currently you can bring your own dataset in hdf5 format and use it in this workload. However, we don't support any dataset out of the box like nyc taxi at this moment. Please let us know if you see any gap in using this workload for your use case.

Hi @VijayanB - Vesa and I are in the same team at Intel. We plan to run the vectorsearch workload you pointed to as well. We're also running the neural search embedding benchmark, and would like to add that to the benchmark repo. Our goal is to analyze the pipeline, and identify optimization opportunities where we can add value to OpenSearch.

Having a vector search pipeline - with, and without embedding generation will allow us to dive deep into two key benchmarks relevant to OpenSearch. Let us know if there's any other scenarios which would be useful to analyze, and optimize.

VijayanB commented 4 months ago

@akashsha1 Thanks for clarification. Having new workload for neural search is definitely a good idea. Like @ylwu-amzn mentioned, feel free to send out PR. Thanks