Open vpehkone opened 4 months ago
@vpehkone We recently created a new workload for vector search https://github.com/opensearch-project/opensearch-benchmark-workloads/tree/main/vectorsearch . Currently you can bring your own dataset in hdf5 format and use it in this workload. However, we don't support any dataset out of the box like nyc taxi at this moment. Please let us know if you see any gap in using this workload for your use case.
@vpehkone Thanks, for ml-commons plugin, the maintainer team busy with some other tasks now. They can come back on this task later. Or if you have bandwidth, feel free to contribute for vector embedding generation benchmarking.
@vpehkone We recently created a new workload for vector search https://github.com/opensearch-project/opensearch-benchmark-workloads/tree/main/vectorsearch . Currently you can bring your own dataset in hdf5 format and use it in this workload. However, we don't support any dataset out of the box like nyc taxi at this moment. Please let us know if you see any gap in using this workload for your use case.
Hi @VijayanB - Vesa and I are in the same team at Intel. We plan to run the vectorsearch workload you pointed to as well. We're also running the neural search embedding benchmark, and would like to add that to the benchmark repo. Our goal is to analyze the pipeline, and identify optimization opportunities where we can add value to OpenSearch.
Having a vector search pipeline - with, and without embedding generation will allow us to dive deep into two key benchmarks relevant to OpenSearch. Let us know if there's any other scenarios which would be useful to analyze, and optimize.
@akashsha1 Thanks for clarification. Having new workload for neural search is definitely a good idea. Like @ylwu-amzn mentioned, feel free to send out PR. Thanks
Is your feature request related to a problem?
There is not any workload that would test vector search and vector embedding. E.g. run a similar benchmark test as this neural search tutorial (https://opensearch.org/docs/latest/search-plugins/neural-search-tutorial/) does. The current vector search workload does not do vector embedding and requires manually downloading the dataset and converting it to the right format.
What solution would you like?
Create a new workload that uses a pretrained model for vector embedding and executes vector search. This does not require any change to the OpenSearch-Benchmark either the official OpenSearch docker image as there are already ml-common, neural-search and KNN-plugins.
Good dataset for this workload: https://microsoft.github.io/msmarco/ Documents: https://msmarco.blob.core.windows.net/msmarcoranking/collection.tar.gz Query texts: https://msmarco.blob.core.windows.net/msmarcoranking/queries.tar.gz
I can implement this and do PR.