opensearch-project / project-meta

Tools to make it easy to manage processes across the opensearch-project org.
https://opensearch.org/
Apache License 2.0
4 stars 17 forks source link

[PROPOSAL] Fix Flaky Github Actions That Use OpenSearch docker container #62

Open nhtruong opened 1 year ago

nhtruong commented 1 year ago

What is the problem you are trying to solve?

There is a bug from openseach-build that causes the OpenSearch container to occasionally fail as soon as it's booted up. This has caused Integration tests on the Javascript client to fail intermittently, and I've been informed that other repos' workflows are also facing this issue.

Even though the chance of OpenSearch container crashing is only 1 out of 50 (per my benchmarks running thousands of such jobs), the chance of this bug failing a workflow is quite high when you run compatibility tests that stand up a few dozen of OpenSearch instances. This has caused every other Push/Pull-Request to fail the Action check, and requires an admin to rerun the failed jobs. This is not a good experience for the contributors nor the admins

What else have you found out about this problem?

1. Grep for the Killing message:

Run the following script after the container's stood up (You can add this in the make file after docker-compose up):

for i in {1..3}; do \
    sleep 30; \
    if docker logs opensearch_opensearch_1 --tail 10 | grep -q "Killing opensearch process"; then \
        echo "Restarting OpenSearch Container..."; \
        docker restart opensearch_opensearch_1; \
    else break; fi; \
done;
sleep 30;

This is a quick and dirty workaround. You can just copy-paste this script to your workflow-step/make-file (after replacing opensearch_opensearch_1 with your container's name of course), and it will just work.

2. Autoheal + Auto Restart:

I benchmarked both solutions on over 700 jobs each, and they all passed.

dblock commented 1 year ago

@nhtruong I really think we're wasting our time trying to retry restarting the containers, we should fix the root cause - want to try writing a matrix job that runs enough containers in a loop/parallel to reproduce this semi-consistently and collect logs from the opensearch instance that doesn't start? there's an error in there I'm almost sure

nhtruong commented 1 year ago

@dblock For sure. Lemme look for ways to grab better logs than the default container logs which only shows

  Killing opensearch process 10
  Killing performance analyzer process 11
dblock commented 1 year ago

@nhtruong So we like https://github.com/opensearch-project/opensearch-js/pull/304? Let's document how to do that everywhere else? Can we reuse some of those GH workflows? Do we need a doc on integration testing?