opensearch-project / opensearch-migrations

All things migrations and upgrades for OpenSearch
Apache License 2.0
31 stars 25 forks source link

[Proposal] Create Upgrade Testing Framework #30

Open chelma opened 1 year ago

chelma commented 1 year ago

Summary Of Work Being Proposed

It is proposed that the OpenSearch Project create a framework that makes it easy to test the results of performing an cluster version upgrade on ElasticSearch/OpenSearch clusters. This framework will accelerate development of improvements to the use-story for upgrades. Additionally, it will enable cluster administrators to create simulacra of their real-world clusters and attempt an upgrade in a safe environment to determine the impact on data, metadata, plugins, etc. Finally, it provides a place for the wider community to centralize its knowledge of how to perform an upgrade, edge-cases associated with different versions of the software/plugins, and document incompatibilities and their resolution instead of spreading that knowledge out across blog posts, private wikis, and tribal knowledge.

Terminology

Assumptions

Why Is The Work Needed?

The existing backwards compatibility (BWC) tests in the OpenSearch Project repos currently capture component-level, happy-path expectations. However:

What Use-Cases Will The Work Resolve?

Project Tenets

Proposed Design

It is proposed to make a command-line tool that can be executed to test upgrade between arbitrary cluster configurations. The tool will be composed of multiple abstraction layers to separate responsibilities and enhance extensibility. Docker will be used to set up the test cluster on the user's machine.

Python-Based Orchestration Layer

A Python orchestration layer will serve as the user portal into the framework via a command-line interface. It will accept an incoming test request, set up the required cluster, execute the requested upgrade, initiate the analysis/testing steps at appropriate times, provide terminal output indicating progress, and provide a reference to the detailed final results. Python is a flexible and practical language for interacting with operating systems and widely used in-industry. Additionally, the language allows new features and bug fixes to be easily tested via direct modification without needing to re-compile the code or use specific integrated development environments, decreasing the level of effort to contribute to the framework.

Docker-Based Configuration Management

Docker will be used to set up and configure the test cluster. This ensures the portability, isolation, and repeatability of the framework. Docker is widely-used in industry for this purpose and will lessen the knowledge-burden required to use the framework. Users will be able to test clusters backed by arbitrary distros on any device (including laptops) rather than needing dedicated hosts. Node/Cluster setup will be repeatable, and test teardown will be automatic. Users and can easily bring their own setup by swapping out which Docker image(s) the framework uses for the simulated migration. Users can build an image on-device from a supplied Dockerfile or improve the setup time by pulling pre-built images from a user-selected repo.

Python-Based Analysis Layer

A Python analysis layer will interrogate the test cluster at each point of its upgrade to determine whether it is proceeding according to expectations or not and produce partial results that will later be combined into a final report. For example, one expectation might be that the same number of documents exist in the test cluster before and after the upgrade. Another expectation could be that the representation format of a given field might change over the course of the upgrade. Another expectation might be the upgrade failed due to an incompatibility between two plugin versions.

Report-Generation Layer

The report-generation layer will assemble the partial results created by the orchestration layer’s periodic invocation of the analysis layer into a final report for the user.

Open Questions

Alternatives to Docker

Docker is the industry-standard tool for containerization. However, it is not a free-and-open-source tool. Per Docker, there are carveouts for individual developers, small companies, and open source development, but otherwise a license is required (see here). It's questionable whether this framework would qualify for their open source carveout (see here). In the event that the framework does not qualify, a license would likely be required for at least some users to leverage/contribute to the framework. License fees are small ($9/user/month, see here) and would only be needed by the specific Cluster Admins and OpenSearch Developers using the tool.

Therefore, it seems like Docker is a reasonable choice for building the framework around. However, if it is deemed that Docker is not a viable choice, for whatever reason, then a possible alternative is Podman.

About Podman

Disclaimer: the author has minimal experience w/ Podman outside of reading docs/blog posts.

Podman is a Linux-native, free, open source containerization program see here that is compatible with the same Open Containers Initiative mechanisms and formats that Docker relies on. This means that it supposedly behaves quite similarly to Docker, and can use most Docker images in public repos without issue.

The biggest differences between Docker and Podman for our use-cases appear to be:

dblock commented 1 year ago

Does this issue belong in https://github.com/opensearch-project/opensearch-devops? Or somewhere else? This repo is really for producing the distribution of OpenSearch.

chelma commented 1 year ago

Maybe. Probably? Looking for some guidance here as this is my first time posting a proposal to the project. Where do you think it should live and get visibility?

mch2 commented 1 year ago

@dblock @chelma I think we should move this to the main repo for feedback/discussion.

dblock commented 1 year ago

@chelma I suggest bringing some of this discussion in some presentation form to the community meeting, too!

chelma commented 1 year ago

@dblock Great idea, will do!

gregschohn commented 1 year ago

I like the usage of docker, but it will slightly narrow the scope of what kinds of tests you'd like to run. I'm thinking of performance tests - confirming that there are no regressions in performance for a given workload across a similarly situated cluster. Another one, would be to test on windows clusters, once OpenSearch is available for > 1 release.

Those can be future concerns, but it would be nice to not over constrain the design now.

dblock commented 1 year ago
peternied commented 1 year ago

Great proposal thanks for putting this out there!

We've added a GitHub Action [1] that orchestrates the BWC framework for ad-hoc version to version tests, with the following input. If we could cleanly use this tool to replace the under-the-covers components of this GitHub Action our team would gladly adopt.

jobs:
  last-supported-major-to-current:
    ...
    - uses: ./.github/actions/run-bwc-suite
      with:
        plugin-previous-branch: "1.3"
        plugin-next-branch: "2.x"
        report-artifact-name: BWC-Last-Supported-Major

  current-to-next-unreleased-major:
    ...
    - uses: ./.github/actions/run-bwc-suite
      with:
        plugin-previous-branch: "2.x"
        plugin-next-branch: "main"
        report-artifact-name: BWC-Next-Major

[1] https://github.com/opensearch-project/security/pull/2253

chelma commented 1 year ago

Per comments and discussion, changed name to "Upgrade Testing Framework".

chelma commented 1 year ago

@dblock @peternied I wrote up a new doc [1] exploring the user experience for the framework that I think addresses most of your comments/suggestions. Would love to get your eyes on it if you have a few spare cycle.

[1] https://github.com/opensearch-project/opensearch-migrations/issues/29