Create a "nightly" instance of Playground that gets updated everyday with the daily build.

CEHENKLE commented 1 year ago

Is your feature request related to a problem? Please describe

I'd like to be able to play with new features faster as they're getting build on OpenSearch. Therefore, I'd like a "nightly" version of playground to get built with every successful build -- first with the 2.x build, but ultimately with main as well.

Describe the solution you'd like

Whenever there's a successful night built, we replace the current nightly playground build automatically. We will need a way to see what version we're pointing at.

Acceptance Criteria:

Support upcoming version for 2.x
OS and OSD gets deployed nightly on a regular basis using the latest builds.
Have generic anonymous(read-only) access for OpenSearch DashBoards.
Display the commits related to a build-id.
The instance should be publicly accessible.

Action Items

[x] https://github.com/opensearch-project/opensearch-devops/issues/130
[x] https://github.com/opensearch-project/opensearch-cluster-cdk/issues/72
[x] https://github.com/opensearch-project/opensearch-cluster-cdk/issues/73
[x] https://github.com/opensearch-project/opensearch-cluster-cdk/issues/74
[x] https://github.com/opensearch-project/opensearch-cluster-cdk/issues/75 : Research and integrate opensearch-cluster-cdk into nightly playgrounds. May involve publishing the cdk code base to npm or consume it by cloning the repo.
[x] https://github.com/opensearch-project/opensearch-devops/issues/131
[x] https://github.com/opensearch-project/opensearch-devops/issues/132
[x] Set up accounts and required permissions (OIDC) to deploy the stacks on the infrastructure
[x] Ensure the deploying artifact validation always validates with security plugin strictly
[x] Add GHA deployment workflow for tools and notification stack
[x] Add GHA deployment workflow for opensearch-cluster-cdk stacks to deploy nightly.

bbarani commented 1 year ago

@CEHENKLE Thanks for opening an issue. Should we eventually create playground instance for all actively supported versions i.e. 1.x, 2.x and main branch? Also, I assume its for both OpenSearch and OpenSearch dashboards?

CC: @opensearch-project/engineering-effectiveness

gaiksaya commented 1 year ago

Hi @CEHENKLE ,

Couple of questions in addition to above, before we can get started:

We recently introduced continuing the build even if few of the non-crucial components failed to build. When we want to deploy the nightly, do you expect all components to be present or it is just enough that cluster is up and running even though it is missing something like sql plugin because it did not pass the build? Applies to both OpenSearch and OpenSearch dashboards
In case of consistent failures for days/weeks due to crucial components, the playground won't be updated until they are fixed. Is that okay?

Crucial components today: Both OS and OSD core engines, job-scheduler and common-utils.

bbarani commented 1 year ago

Hi @CEHENKLE ,

Couple of questions in addition to above, before we can get started:

Crucial components today: Both OS and OSD core engines, job-scheduler and common-utils.

We recently introduced continuing the build even if few of the non-crucial components failed to build. When we want to deploy the nightly, do you expect all components to be present or it is just enough that cluster is up and running even though it is missing something like sql plugin because it did not pass the build? Applies to both OpenSearch and OpenSearch dashboards

My 2 cents. We should just deploy latest snapshot artifact (with whatever plugin in there and we have a logic to fail builds for crucial plugins already). It would be great if we can annotate the build, tests details corresponding to the deployed artifact ( either via iframe, or through indexing) so users would know the list of plugins available on that build.

In case of consistent failures for days/weeks due to crucial components, the playground won't be updated until they are fixed. Is that okay?

Surfacing the build information along with the build date used for a specific deployment would help to understand the long pending failures.

prudhvigodithi commented 1 year ago

We can even explore adding all the build details, plugin information, jenkins build URL, inside a slash path (/) Example: https://nightly.playground.opensearch.org as the main dashboard page https://nightly.playground.opensearch.org/buildDetails has above mentioned information.

Related Jenkins Example: https://build.ci.opensearch.org/systemInfo

gaiksaya commented 1 year ago

Approaches

1. Use opensearch-cluster-cdk

Use opensearch-cluster-cdk as a mechanism that to deploy the cluster. This code base already has the functionality to deploy the nightly built artifacts that includes both OpenSearch and OpenSearch Dashboards. For any missing functionalities such as customized opensearch.yml, security permissions, etc can be contributed to the code base.

[Image: image.png]

Pros

Well established, tested and actively used code base readily available to use.
Used to deploy other publicly available OpenSearch and OpenSearch Dashboards cluster (data-store cluster)
Needs minimal changes
Reproducible for the community
Has future scope to onboard other distributions (if at all required)

Cons

As of now, this set-up only tests tarball as a distribution
Development would depend on the development on opensearch-cluster-cdk
Cluster needs to be managed by ourselves. Need to take care of everything starting with data, permissions, deployments
Need to add new workflow/pipeline for active deployments

2. Onboard to existing playground framework

The GitHub repository dashboards-anywhere is responsible for hosting multiple instances of playground that are hosted today. Few examples are as below:

dashboards-anywhere has on-boarded multiple instances of playground that can be tracked here: https://github.com/opensearch-project/dashboards-anywhere/tree/main/config/playground/helm The code base uses EKS, terraform and helm all together to form an end cluster. The deployments are taken care automatically by the Github Actions workflows.

Since this used helm-charts at the backend, all we need is a container image (dockerhub/ECR) pushed to staging daily (which we already have it.)

Pros

Actively used code base for current playground set up.
Deployment is taken care of by the repository GitHub Actions
Maintenance of the code base is a shared effort
Detailed on-boarding instructions

Cons

Extensive manual set up required. We might need to spend sometime to understand and source code the infrastructure part of it.
We are not using a direct distribution here (example: docker, tarballs, etc) but deploying using helm. We are expected to see helm, EKS related issues. However, features and functionalities of core software should not be affected.
Since we are not the code owners of the repository, the development would be dependent on the contributing PRs (we do have an option to become maintainers)
Semi-centralized code base
Needs improved credential handling in GitHub Actions

gaiksaya commented 1 year ago

Would like to get some input on what approaches from @Flyingliuhub @dblock @bbarani @AMoo-Miki . Please feel free to tag people who you think can provide valuable input to this. Thanks!

AMoo-Miki commented 1 year ago

Thanks Sayali. Both of these are great options and as you pointed out have great pros and probably painful cons.

Considering that extensive manual setup is needed now (and will probably be needed again when certain updates happen), and that we could face challenges that we cannot fix ourselves, the existing playground framework sound more challenging to be setup. I also wonder if this pipeline will make it less customizable for us to do crazy things like manipulating the installations.

I am working on a proposal for the Playground to use a modified security plugin that will create read/write metadata indices for each visitor, allowing them to login anonymously but experience a fully functioning Dashboards. I suspect the modifications to the plugin would be applied as a post-install patch. Similarly, we might want to have the latest version of OUI included in the nightly builds; patching post-install would be better than building specific images for playground that are different from the images we release nightly.

While being forced to maintain the infrastructure could be a pain, I feel it would be good for us to learn of the pains and solve them for the users.

Considering the freedom to customize (without building different images) and the ability to learn, the opensearch-cluster-cdk sounds more attractive to me.

gaiksaya commented 1 year ago

Thanks @AMoo-Miki. If there are going to be customization and additional installations I agree opensearch-cluster-cdk gives us that edge. Couple of questions before I proceed with drafting a design for this:

Can I know what role does OUI play in this?
When you say customized security plugin, I believe you mean customized permissions rather than the default ones?

AMoo-Miki commented 1 year ago

Can I know what role does OUI play in this?

OUI is a vital component of OSD which has its own release cycle. Any change in OUI has the potential to change the UX of OSD. For example, the recent UX changes to OSD were almost completely driven by OUI and we had to resort to setting up our own endpoints for nightly builds to showcase the changes.

When you say customized security plugin, I believe you mean customized permissions rather than the default ones?

My idea is much crazier than that: the idea patches the built artifacts of security-dashboards-plugin to allow for randomly suffixed .kibana/.opensearch_dashboards metadata stores.

In my vision:

OUI has two artifacts: (a) the consumable code and (b) its docs site which has its own playground
Every night, an attempt is made to build OUI, OSD, OS, and plugins; upon successful completion, these nightly artifacts are made public through the normal channels.
The latest nightly artifacts for OSD and its plugins are deployed to a staging environment and any custom patches are applied; these include a patch to use the nightly artifact of OUI, as well as security plugin's patch for random suffixes.
1. If OSD fails to start due to an OUI incompatibility, the last known working nightly of OUI will be patched in and an issue will be raised on OUI to fix the problem. The nightly artifact for OUI will be marked as broken.
2. If OSD fails to start due to a plugin, the plugin will be swapped with the last known working nightly and an issue will be raised with them.
The latest nightly artifact of OS and its plugins are deployed to the staging environment and any custom patches are applied; I don't have thoughts on any right now.
1. If OSD fails to start due to an OS incompatibility, the last known working nightly of OS will be spun up and OSD will point to it,
2. If OSD fails to start again, the problem is with OSD; an issue will be raised on OSD to fix the problem and the nightly will be marked broken. The last known working OSD will be used instead with the latest nightly of OS.
3. If OSD does start up, the problem is with OS; an issue will be raised on OS to fix the problem and the nightly will be marked broken.
4. If OS fails to start due to a plugin, yada yada yada!
The latest working artifacts from the above steps are deployed to playground; nightly OUI docs are deployed to OUI's website.
Custom data is populated to showcase all of the capabilities.

We might also want to keep the previous night's deployments active on a different port or fleet to be able to quickly switch if we find something horribly wrong in the morning.

PS, looking at these, you might feel custom scripts would be easier to build than a cdk; if you do, you are not alone :D

gaiksaya commented 1 year ago

Moving this issue to opensearch-devops repository as we are planning to host the codebase there. Thanks!

gaiksaya commented 1 year ago

Please see the high level design posted here: https://github.com/opensearch-project/opensearch-devops/issues/130 Thanks! Will be modifying the description of this issue into smaller issues. Thanks!

gaiksaya commented 3 months ago

Closing this issue as nightly playgrounds have been successfully working for last 2-3 releases https://playground.nightly.opensearch.org/ There are upcoming enhancements such as #153 which will be followed up in mentioned issue.

opensearch-project / opensearch-devops