[RFC] Create-Workload: Enhance Experience Building Workloads from Existing Cluster Data

IanHoang commented 1 year ago

Synopsis

OSB’s create-workload feature is a useful capability that enables users to distill their workloads into a format that OSB can use. However, based on the experience gained from using it in various scenarios, it is clear that there are nuances that can restrict users from using it effectively. There is potential for the user experience to be improved substantially as well.

This RFC highlights areas in the create-workload feature that require improvements and suggests solutions to address them.

Scope

This RFC solely focuses on improving user and developer experience of building workloads from existing cluster data. A separate RFC should be made for users who want to improve the experience of building workloads from scratch.

Motivation

Many users are interested in creating representative workloads out of existing cluster data and are looking towards OSB’s create-workload feature to accomplish this. As documented in the Creating Custom Workloadsuser guide in the official documentation, users can either build workloads manually or use OSB to extract data corpora from a cluster and automatically build a workload. Since the create-workload feature cannot deduce the queries that comprise the target workload, it generates a single match-all query as a sample query, to collect documents from the cluster. Users are encouraged to supply additional queries in the form of a JSON file.

This is a typical workflow of how OSB generates a workload: create-workload-workflow Figure 1: Create-Workload Workflow

In the figure above, a user will invoke OSB’s create-workload feature and provide a few parameters detailing the workload specifications such as workload name, cluster to retrieve data corpora from, the indices to extract from the cluster, queries, etc. These will be passed to the workoad_generator.py which will validate queries (if any provided) for proper JSON format, create a single instance of opensearch-py client, sequentially extract index mapping and data corpora from each index specified, and then compiles all details into a workload.json. Finally, the workload path is outputted and users can use that path to run the workload. If users want to add more queries or additional testing scenarios, they can add them in the directory of the workload path. For more information on the basic anatomy of an OSB workload, see workload anatomy recap section in appendix.

Currently, this feature is still evolving and can do with several improvements. Additionally, it potentially adds to the common assumption that the data corpus is all that’s needed to create a representative workload when that’s far from true. By nature, workloads are complex because they are influenced by several factors — such as shard size, search and ingest traffic, replica count, cluster configuration, and much more — these factors exist outside of data corpora. If the community is to rely on OSB as a method to create custom representative workloads, the following areas of improvements must be addressed.

Stakeholders for this RFC

OpenSearch users and developers, the primary consumers of any new and enhanced workloads
OpenSearch developers implementing new features, who may be interested in performance testing them
Commercial managed services that track OpenSearch performance for the releases they offer
Corporate benchmarking teams, who may be interested in benchmarking their use-cases with OpenSearch against other options
OSB developers, responsible for implementing the workload enhancements

Areas of Improvement

To improve the create-workload experience, there are three major areas this RFC covers — Building Workloads, Create-Workload Development Process, and User Documentation.

Building Workloads

These points here greatly slow down the workload building process.

1. Slow and Inefficient single-threaded extraction

For example, if users who want to extract 3 indices called index A, index B, and index C, OSB will extract index A first, then index B, and then finally index C. This is a time consuming process for users who are extracting large sized indices and many indices. Since the process is slow, users have turned to run their extraction process in multiple terminal sessions. This comes with its own nuances, such as needing to consolidate all index data corpora into a single directory and rewrite the workload.json, and impact the user experience.

2. Frustrating read timeouts.

The current method of extraction is arduous as it runs a match-all query on each index the user has specified. It’s been noticed that for indices with large amounts of data, users occasionally encounter read timeouts. This has occurred with indices containing greater than 100 GB and even when there’s no other traffic on the cluster. For users extracting multiple indices in multiple terminal sessions, it’s been noted that some users occasionally come back to find out that extraction failed at one of the indices, requiring the user to start over again. (Note: this is a separate and independent issue from point 1 above. Solving point 1 will not resolve the issues of read timeouts that occur every now and then)

3. Fixing the incomplete workloads are painful and a waste of developer time

Occasionally, OSB will encounter an issue, such as read timeouts or not enough disk space error, and will leave users with an incomplete workload (i.e. workload that cannot be run because it is missing files). The workload directory will be missing a workload.json file and potentially data corpora for some indices. For users not familiar with OSB, this is a painful process as it requires the user to build their own workload.json file, which requires inputting the number of docs, uncompressed and compressed bytes, and formatting their queries. This is a time consuming process and even more tedious for larger workloads. Additionally, the data corpora extracted might have been interrupted while writing a document, thus the most recent document extracted is partially complete. This requires users to find and remove this partial document, which will be time consuming on the command line for larger data corpora. To illustrate one example, a user was extracting an index of approximately 300GB. OSB was only able to extract 92% of the index, leaving the last line of the document partially complete. The user had to run a linux command to remove the last line that took about 30 minutes (which was essentially a copy and paste but required doubling the EC2 instance the user was working on) and then had to build the workload.json from scratch. Building representative workloads is an iterative process, meaning that it might require the user to run this feature several times to get the right workload, and it can ruin the user experience if this happens constantly.

4. Simplified workload structure prevents customizability

When a workload is produced, it usually contains the data corpus, index settings, and a workload.json file. Although this is a simplified workload, it actually makes it more difficult for users to extend the workload. The supplied OSB workloads have a similar structure but instead of having the test procedures and operations both condensed in a single file (workload.json), they have them in separate directories. This component based approach improves customizability. Since users are more familiar with the official workloads, create-workload should produce a workload with identical structure to the official workloads. Additionally, if a workload is ever added to the official workloads, the transition will be much smoother.

5. Restrictive progress bar

OSB tightly couples a progress bar to the function that extracts OSB’s data corpora. If there OSB ever needs to enhance the way it extracts data corpora, this will require reworking the progress bar. For example, a user was testing a new sub-feature within the create-workload feature and impacted the progress bar. The progress bar took the user more time to fix than implementing the actual feature.

6. Error Prone Approach to Add Custom Queries

OSB allows users to provide custom queries to their workloads by passing in a JSON file containing the queries via the --custom-queries parameter of the create-workload feature. However, this approach is prone to user-formatting errors since users are required to add an operation name, type, and body. This is more work for the user as they will have to focus on proper formatting.

7. Cannot Create Workloads from AOSS Collections (Serverless Amazon OpenSearch)

OSB users are unable to create workloads from AOSS Collections. This is because create-workload feature is still coupled to OpenSearch-Py calls and APIs that are not compatible with AOSS Collections.

Create-Workload Development Process

1. Not easily extendable

The code for create workload is inefficient and restrictive. As of now, it’s only a set of function calls that are tightly coupled.

Documentation

1. No proper guidance on how to build a representative workloads

There’s currently only a reference on how to use OSB to create a workload but no guidance on how to make it a representative workload. Some users might assume that the workload produced is identical in every way to their cluster’s workload. On the contrary, users must be aware of the characteristics of the workload (the sharding, query and ingestion patterns, and more) and aim to replicate those in order to make it representative.

Proposed Priorities and Solutions

These are solutions that will greatly improve the create-workload experience. These are in no particular order and each can be opened in an independent issue within the OpenSearch Benchmark repository.

1. Redesign the Create Workload feature: The current design is a naive approach that feels restrictive and isn’t easily extensible. By redesigning this with a composition-based approach, adding new features and testing will be easier. Also, to improve overall performance, the feature can use ZSTD compression when compressing data during extraction process. There's already an issue (#385) to support ZSTD decompression when OSB decompresses data for a test.

2. Add the ability to parallelize the extraction of data: Currently, users will specify the indices to extract data from and OSB will extract them in sequential order. Instead of having users run multiple sessions to extract multiple indices, OSB should support extracting indices in parallel. To do this, OSB should provision a separate process for each index and once all processes have finished, it will write a workload.json. This would remove the need for users to provision different terminal sessions, consolidate the data corpora into a single directory, and rewrite the workload.json.

3. Add read retries and produce a runnable workload for partial data corpora: Users who experience read timeouts or other interruptions during the extraction process are often left with incomplete workloads. To ease this experience, OSB should attempt retries and not only rely on opensearch-py client to attempt retries. If retries do not work, have OSB remove the last incomplete document in the data corpora, regenerate the compressed data corpora, and produce a workload.json. Although users will be left with an incomplete workload, they can still use the workload and will not need to go through the painful process of editting and regenerating files manually and creating a workload.json. Another option is to provide indices that failed in the workload.json or a separate file. With this, OSB will know which indices failed and can retry the extraction of failed indices or potentially retry extracting data where it last failed.

4. Write test procedures and operations to their respective directories: Instead of writing all workload attributes into workload.json, create-workload should organize test procedures and operations into their respective directories. Users are more familiar with this structure as it is identical to the official workloads’ directory structure and makes it easier to extend test procedures and operations.

5. Add User Guide Documentation: Documentation that educates users on what can be done to make the workload produced more representative of their reference cluster’s workload

6. Replace progress bar with a progress bar Python library: The progress bar should be decoupled from the extraction function. This will make it easier to improve the way OSB extracts the data corpora without interfering with the progress bar. It will also make it easier to maintain and update the progress bar.

7. Add capability to dynamically populate queries with a directory of queries in addition to a file of queries: Currently, OSB allows users to pass in a JSON file containing queries via the --custom-queries parameter for the create-workload feature. However, this is prone to user-formatting errors. Another good option would be to allow users to specify a directory of JSON files, with a parameter such as --custom-queries-directory, where each file is named after the query. Each file would just need to contain the body of the query. This would remove the need for the user to construct the JSON file to have an operation name, type, and body. We would also have to rename the --custom-queries parameter to --custom-queries-file to be more specific now that there are two approaches to add queries. While it removes the potential for user-formatting errors, it might restrict users from only providing search queries in the directory. We'll need add support for users who want to apply other types of operations.

8. Add support for creating workloads from AOSS collections: Users are unable to create workloads from AOSS collections. This causes them to either mirror data to a AOS / OpenSearch cluster or run a script to fetch all the documents and build the workload from scratch. In order to address this, a component-based architecture would be beneficial.

9. Adding support for random sampling: Users might be interested in creating a smaller model of their workload and use a fraction of documents. However, to keep it representative with the original indices, users might be interested in creating a workload with every other Nth document in their indices. For more details, see this issue.

10. Add Default Index Settings to Index.json produced: Users should be able to have default index settings in the index.json files produced. For more details, see this issue.

Use Cases

As an OpenSearch Benchmark developer, I want to be able to easily add new features to the create workload feature
As an OpenSearch Benchmark user, I would like to be able to produce custom workloads that are representative and have specific characteristics from my cluster’s workload.

How Can You Help?

Any general comments about the overall direction are welcome.
Indicating whether the areas identified above for workload enhancement include your scenarios and use-cases will be helpful in prioritizing them.
Provide early feedback by testing the new workload features as they become available.
Help out on the implementation! Check out the issues page for work that is ready to be picked up.

Appendix

Workload Anatomy Recap
Data corpora → documents to ingest and run operations on
Index.json → contains index mappings. One index.json per index in workload
Workload.json → starting point and references information of every other file
Operations → operations run during the test, typically ingest operations and search operations
Test Procedures → order to run operations

gkamat commented 1 year ago

The proposed changes will improve the create-workoad feature immensely. It would be useful to address them in a phased manner. Prioritizing them in an appropriate order with intermediate deliverables will enable users to try them out and provide suggestions to enhance the feature even further.

Adding additional queries such as term queries inferred from the corpus, and perhaps sort/range/aggregation queries as well will improve the workload generated too.

IanHoang commented 9 months ago

Based on @gkamat PR in OSB's workload repository, have added an additional priority to add support for a better approach to add custom queries.

qiaoxux commented 8 months ago

Thanks @IanHoang. Here are few asks from Serverless team if they have not been covered in the current improvement plan:

Support create-workload for Serverless collections
Support large workload (Up to 100 TB) creation for both log analytics and search text use cases.
Support event data workload where ingestion can run endlessly. In this case, total collection size can be controlled by retention policy.

IanHoang commented 3 months ago

1. Redesign the Create Workload feature has been addressed in #609.

IanHoang commented 3 months ago

Created META issue to track priorities: https://github.com/opensearch-project/opensearch-benchmark/issues/616

opensearch-project / opensearch-benchmark