opensearch-project / opensearch-benchmark

OpenSearch Benchmark - a community driven, open source project to run performance tests for OpenSearch
https://opensearch.org/docs/latest/benchmark/
Apache License 2.0
102 stars 75 forks source link

Support for multiple files for bulk operation #473

Open himanshu-amazon opened 5 months ago

himanshu-amazon commented 5 months ago

**Is your feature request related to a problem?

We want to use multiple files in source-file parameter of copora of a custom workload i am using for OSB. Currently OSB does not provide this feature.

Describe the solution you'd like

We want the ability to pass multiple files names or regex expression (e.g. "" means all files at a path) in parameter source-file in the corpora. For example in below corpora we are passing base_url of an s3 bucket and OSB should be able to pickup all files at the location if we pass "source-file: " and need not pass document-count as that will be pain to count total document in all files.

corpora": [ { "name": "movies", "documents": [ { "source-file": "*", "base-url": "s3:////", "document-count": 1 } ] }

Describe alternatives you've considered

I passed multiple patterns like below in the corpora for each file however that is difficult if we have hundreds of files

"corpora": [ { "name": "movies", "documents": [ { "source-file": "test.json.zip", "base-url": "s3:////", "document-count": 1 } ] }, { "name": "movies1", "documents": [ { "source-file": "test.1.json.zip", "base-url": "s3:////", "document-count": 1 } ] } ],

IanHoang commented 5 months ago

Thanks @himanshu-amazon for suggesting this!

While I understand why it might be more convenient to pull in files from a path and forgo fields such as document-count, I'm not certain that this would benefit most users and benchmarking scenarios.

OSB was designed as a benchmarking tool that focuses on using structured and reusable workloads to perform tests. Even if OSB were to pull in hundreds of files from an S3 path, it would still need to know which documents need to go to which indices, which differs between use cases and would need to be specified for each file it pulls in.

It's worth noting that OSB doesn't preprocess any of the data that is being ingested but requires the corpora files to be written in a specific structure. Fields such as document-count, compressed-bytes, and uncompressed-bytes act as basic guard rails to ensure that OSB can properly validate the corpora before beginning the test and keep track of the ingestion progress during the test. Additionally, many users use this information in the workload.json as a source of truth if any errors are encountered during the test and for some reason the number of documents ingested is different from what's specified in the workload.json.