Closed gkamat closed 2 months ago
From your testing can you please add details on approximately how much time it takes to download, unzip and process the corpus for indexing? Also, how much time it took for you to index this data?
From your testing can you please add details on approximately how much time it takes to download, unzip and process the corpus for indexing? Also, how much time it took for you to index this data?
Second this though it'll probably vary based on what the user is using but a ball-park reference for each one would be good as well as what it was downloaded on (eg EC2 instance c5.2xlarge). This information would be useful for larger workloads that are part of the official workloads repository.
Would it be possible to include when to consider running such a workload?
This'll greatly help users determine choosing if this workload is beneficial to their use case.
Some details pertaining to running the workload have been added to the README. Note that this information is very preliminary and will be updated in due course.
@akashsha1 larger data corpora reflect real-life use cases that are more customer representative. They are also essential for scale testing. However, there are additional considerations with these scenarios and some of those will be added to the documentation in the future, once this corpus has been appropriately tested.
Description
Added a 1 TB data corpus for the big5 workload.
Testing
Integ tests and a workload run.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.