Open IanHoang opened 3 months ago
Received feedback to add support for pbzip2 compression now that OSB supports it. Will create a separate PR for it.
@IanHoang, it may be helpful to add some child tasks to this issue, since there are multiple items here.
Overview
This is an issue based off one of the proposed priorities in this RFC: https://github.com/opensearch-project/opensearch-benchmark/issues/395
Background
As of now, OSB's create-workload is a monolith that uses a two modules of functions to create a custom workload. It was inherently designed to be a quick and easy way to build custom workloads off of small corpora. While this approach has worked in the past, there is an increasing demand for building custom workloads based off of complex workloads and more users are using this feature to achieve this.
Users who have been using this feature have mentioned that the create-workload code currently is difficult to extend, maintain, and, for newcomers to OSB, difficult to follow and interpret.
We should rearchitect the code to be more organized and scalable, which in turn will make it easier to extend and maintain. This work will also serve as the foundation for future development, such as extracting a random sampling of the documents and repairing incomplete workloads.
Proposed Design
While the existing approach is considered modular, create-workload in its current state is unwieldy. We have gathered feedback from users who have extended the feature and have used the feature to build custom workloads based on complex production workloads that are up to 10TB. Based on the feedback received, we should rearchitect create-workload to have the following components:
Proposed priority
It also makes it difficult for newcomers to come and understand the code easily. This approach would promote encapsulation and abstraction, overall making create-workload more organized and scalable as well as will be easier to extend and maintain.