Open chelma opened 1 year ago
Does this issue belong in https://github.com/opensearch-project/opensearch-devops? Or somewhere else? This repo is really for producing the distribution of OpenSearch.
Maybe. Probably? Looking for some guidance here as this is my first time posting a proposal to the project. Where do you think it should live and get visibility?
@dblock @chelma I think we should move this to the main repo for feedback/discussion.
@chelma I suggest bringing some of this discussion in some presentation form to the community meeting, too!
@dblock Great idea, will do!
I like the usage of docker, but it will slightly narrow the scope of what kinds of tests you'd like to run. I'm thinking of performance tests - confirming that there are no regressions in performance for a given workload across a similarly situated cluster. Another one, would be to test on windows clusters, once OpenSearch is available for > 1 release.
Those can be future concerns, but it would be nice to not over constrain the design now.
Great proposal thanks for putting this out there!
I don't think this tool should live inside of OpenSearch - I think it should be part of its own repository. The way upgrade testing is conducted shouldn't be tightly coupled with the version of OpenSearch.
Include a mock of the input(s) to the tool, it would help clarify the range of supported scenarios.
We've added a GitHub Action [1] that orchestrates the BWC framework for ad-hoc version to version tests, with the following input. If we could cleanly use this tool to replace the under-the-covers components of this GitHub Action our team would gladly adopt.
jobs:
last-supported-major-to-current:
...
- uses: ./.github/actions/run-bwc-suite
with:
plugin-previous-branch: "1.3"
plugin-next-branch: "2.x"
report-artifact-name: BWC-Last-Supported-Major
current-to-next-unreleased-major:
...
- uses: ./.github/actions/run-bwc-suite
with:
plugin-previous-branch: "2.x"
plugin-next-branch: "main"
report-artifact-name: BWC-Next-Major
[1] https://github.com/opensearch-project/security/pull/2253
Cluster Admin uses cases are framed around a knowledgeable admin. Depending on how we want this tool to be used it might be worthwhile to invest in lowering the barrier to extracting value from the tool. Automatically discovering the current cluster configuration is a great way to add instant value. Helping the cluster admin know what is tested or what could be tested might also be of value.
This tool is framed around relatively local testing (containers), supporting remote/cloud managed cluster as sources or destinations would be useful for many more migration scenarios.
Might want to describe the capabilities of tests that are executed, seems like the tool will need some level of test harness / tracking.
Checkout https://github.com/opensearch-project/opensearch-benchmark while built for performance benchmarking on a single cluster with cluster standup / interaction, maybe it can be extended or aspects reused.
Orthogonally to the functionality proposed - I'd recommend the tool be written in a strongly typed language.
Per comments and discussion, changed name to "Upgrade Testing Framework".
@dblock @peternied I wrote up a new doc [1] exploring the user experience for the framework that I think addresses most of your comments/suggestions. Would love to get your eyes on it if you have a few spare cycle.
[1] https://github.com/opensearch-project/opensearch-migrations/issues/29
Summary Of Work Being Proposed
It is proposed that the OpenSearch Project create a framework that makes it easy to test the results of performing an cluster version upgrade on ElasticSearch/OpenSearch clusters. This framework will accelerate development of improvements to the use-story for upgrades. Additionally, it will enable cluster administrators to create simulacra of their real-world clusters and attempt an upgrade in a safe environment to determine the impact on data, metadata, plugins, etc. Finally, it provides a place for the wider community to centralize its knowledge of how to perform an upgrade, edge-cases associated with different versions of the software/plugins, and document incompatibilities and their resolution instead of spreading that knowledge out across blog posts, private wikis, and tribal knowledge.
Terminology
Assumptions
Why Is The Work Needed?
The existing backwards compatibility (BWC) tests in the OpenSearch Project repos currently capture component-level, happy-path expectations. However:
What Use-Cases Will The Work Resolve?
Project Tenets
Proposed Design
It is proposed to make a command-line tool that can be executed to test upgrade between arbitrary cluster configurations. The tool will be composed of multiple abstraction layers to separate responsibilities and enhance extensibility. Docker will be used to set up the test cluster on the user's machine.
Python-Based Orchestration Layer
A Python orchestration layer will serve as the user portal into the framework via a command-line interface. It will accept an incoming test request, set up the required cluster, execute the requested upgrade, initiate the analysis/testing steps at appropriate times, provide terminal output indicating progress, and provide a reference to the detailed final results. Python is a flexible and practical language for interacting with operating systems and widely used in-industry. Additionally, the language allows new features and bug fixes to be easily tested via direct modification without needing to re-compile the code or use specific integrated development environments, decreasing the level of effort to contribute to the framework.
Docker-Based Configuration Management
Docker will be used to set up and configure the test cluster. This ensures the portability, isolation, and repeatability of the framework. Docker is widely-used in industry for this purpose and will lessen the knowledge-burden required to use the framework. Users will be able to test clusters backed by arbitrary distros on any device (including laptops) rather than needing dedicated hosts. Node/Cluster setup will be repeatable, and test teardown will be automatic. Users and can easily bring their own setup by swapping out which Docker image(s) the framework uses for the simulated migration. Users can build an image on-device from a supplied Dockerfile or improve the setup time by pulling pre-built images from a user-selected repo.
Python-Based Analysis Layer
A Python analysis layer will interrogate the test cluster at each point of its upgrade to determine whether it is proceeding according to expectations or not and produce partial results that will later be combined into a final report. For example, one expectation might be that the same number of documents exist in the test cluster before and after the upgrade. Another expectation could be that the representation format of a given field might change over the course of the upgrade. Another expectation might be the upgrade failed due to an incompatibility between two plugin versions.
Report-Generation Layer
The report-generation layer will assemble the partial results created by the orchestration layer’s periodic invocation of the analysis layer into a final report for the user.
Open Questions
Alternatives to Docker
Docker is the industry-standard tool for containerization. However, it is not a free-and-open-source tool. Per Docker, there are carveouts for individual developers, small companies, and open source development, but otherwise a license is required (see here). It's questionable whether this framework would qualify for their open source carveout (see here). In the event that the framework does not qualify, a license would likely be required for at least some users to leverage/contribute to the framework. License fees are small ($9/user/month, see here) and would only be needed by the specific Cluster Admins and OpenSearch Developers using the tool.
Therefore, it seems like Docker is a reasonable choice for building the framework around. However, if it is deemed that Docker is not a viable choice, for whatever reason, then a possible alternative is Podman.
About Podman
Disclaimer: the author has minimal experience w/ Podman outside of reading docs/blog posts.
Podman is a Linux-native, free, open source containerization program see here that is compatible with the same Open Containers Initiative mechanisms and formats that Docker relies on. This means that it supposedly behaves quite similarly to Docker, and can use most Docker images in public repos without issue.
The biggest differences between Docker and Podman for our use-cases appear to be: