Added a 1 TB corpus for the big5 workload

opensearch-project / opensearch-benchmark-workloads

Official workloads used by OpenSearch Benchmark (OSB)

https://opensearch.org/docs/latest/benchmark/

11 stars 58 forks source link

Added a 1 TB corpus for the big5 workload #278

Closed gkamat closed 2 months ago

gkamat commented 2 months ago

Description

Added a 1 TB data corpus for the big5 workload.

Testing

Integ tests and a workload run.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

rishabh6788 commented 2 months ago

From your testing can you please add details on approximately how much time it takes to download, unzip and process the corpus for indexing? Also, how much time it took for you to index this data?

IanHoang commented 2 months ago

From your testing can you please add details on approximately how much time it takes to download, unzip and process the corpus for indexing? Also, how much time it took for you to index this data?

Second this though it'll probably vary based on what the user is using but a ball-park reference for each one would be good as well as what it was downloaded on (eg EC2 instance c5.2xlarge). This information would be useful for larger workloads that are part of the official workloads repository.

akashsha1 commented 2 months ago

Would it be possible to include when to consider running such a workload?

This'll greatly help users determine choosing if this workload is beneficial to their use case.

gkamat commented 2 months ago

Some details pertaining to running the workload have been added to the README. Note that this information is very preliminary and will be updated in due course.

@akashsha1 larger data corpora reflect real-life use cases that are more customer representative. They are also essential for scale testing. However, there are additional considerations with these scenarios and some of those will be added to the documentation in the future, once this corpus has been appropriately tested.