opensearch-project / helm-charts

:wheel_of_dharma: A community repository for Helm Charts of OpenSearch Project.
https://opensearch.org/docs/latest/opensearch/install/helm/
Apache License 2.0
168 stars 227 forks source link

[Enhancement][opensearch][opensearch-dashboards] Lifecycle Probes #44

Open ghost opened 3 years ago

ghost commented 3 years ago

Is your feature request related to a problem? Please describe. I think there ought to be liveness/readiness/startup probes for the opensearch statefulset and the opensearch-dashboards deployment.

Currently the pods are immediately marked ready on startup, leading to more than one of the nodes being unavailable which can lead to instability especially with lots of data.

Describe the solution you'd like I am happy to implement something in a PR to resolve this, if that is something the project would want and my general ideas below sound good.

Opensearch: Readiness: query health endpoint, which may require whitelisting or auth, could potentially look for green cluster status Liveness: can check for tcp connection to 9200 or 9300 Startup: can check for health endpoint regardless of status?

Opensearch-dashboards: Readiness: Health is at least yellow in /api/status Liveness: 5601 is open to tcp Startup: same as readiness

Describe alternatives you've considered NA

Additional context I am happy to implement if the ideas sound good.

TheAlgo commented 3 years ago

I completely agree to this and it is kind of compulsory. I also wanted to have something like this included in readiness as well for opensearch apart from the ones mentioned.

ghost commented 3 years ago

That would be clever, do we want to include a sane default, or would we want to leave it disabled until the user chooses to enable it?

TheAlgo commented 3 years ago

That would be clever, do we want to include a sane default, or would we want to leave it disabled until the user chooses to enable it?

What if we can have options to choose between different probes. As in have a basic one and some advanced one as well. Just thinking to make the chart as robust as possible.

peterzhuamazon commented 3 years ago

Transfer to Helm-Charts repo for new PR #43

sastorsl commented 2 years ago

Regarding values.yaml and examples of lifecycle.postStart There is an example for opensearch, but none for opensearch-dashboards pr now.

For opensearch-dashboards I've implemented this, and it works. Dashboards takes over 60 seconds to start, depending on your HW of course, but leave plenty of leeway in the TIMEOUT setting.

lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "echo $(date +%F-%T) This is the end."]
  postStart:
    exec:
      command:
        - bash
        - -c
        - |
          #!/bin/bash
          # Wait for the application to start
          # NB! Update http / https if you have / don't have https.
          # Set a contextroot if you add one in opensearch_dashboards.yml / server.basePath
          URL=http://127.0.0.1:5601/<context-root-if-defined>
          TIMEOUT=180
          #
          # Wait until the server responds
          #
          timeout ${TIMEOUT} bash -c 'while ! [[ "$(curl -s -o /dev/null -w %{http_code}\\n '$URL')" =~ ^[23][0-9][0-9]$ ]]; do sleep 1; done'

(edited comment for clarity)

TheAlgo commented 2 years ago

@sastorsl As far as I know the probes that are added are not present in the statefulset. I will take a look and fix it.

sastorsl commented 2 years ago

@TheAlgo I see my previous comment was unclear. There is no example in values.yaml for opensearch-dashboards, but the above example addition to values.yaml works just fine, and the postStart is added and works quite well.

Ref https://github.com/opensearch-project/helm-charts/blob/main/charts/opensearch-dashboards/templates/deployment.yaml#L130

liveness and readiness probes would be a nice addition though.

peterzhuamazon commented 2 years ago

@sastorsl do you mind I link this issue to #117 and close this issue when that PR merged?

Thanks.

sastorsl commented 2 years ago

@sastorsl do you mind I link this issue to #117 and close this issue when that PR merged?

Fine by me, however I'm not OP though...

peterzhuamazon commented 2 years ago

@sastorsl do you mind I link this issue to #117 and close this issue when that PR merged?

Fine by me, however I'm not OP though...

You guys have similar profile identicons...... @hgoscenski-imanage please check ^^

sastorsl commented 2 years ago

No worries, I think we cleared this one up nicely :-)

bitnik commented 2 years ago

@sastorsl do you mind I link this issue to #117 and close this issue when that PR merged?

What about opensearch chart and liveness and readiness probes? That PR doesn't cover these parts.

Here I want to explain shortly the issue we experienced today with opensearch chart:

After deploying opensearch chart in ansible we request /_cluster/health to check if number_of_nodes == 3 and status == 'green' and it failed, because number_of_nodes was actually 2. When we checked the pods, at first everything looked fine

kubectl get pods

NAME                                                          READY   STATUS    RESTARTS   AGE
opensearch-cluster-master-0                                   1/1     Running   0          7h43m
opensearch-cluster-master-1                                   1/1     Running   0          7h42m
opensearch-cluster-master-2                                   1/1     Running   0          7h42m
opensearch-dashboard-opensearch-dashboards-6c4777c7f6-ltn6q   1/1     Running   0          7h40m

but then when we looked at the logs of master pods, we found out that master-2 was not healthy

kubectl logs  opensearch-cluster-master-2 -f

Killing opensearch process 9
Killing performance analyzer process 10

We restarted master-2 pod, then number_of_nodes became 3 and then our deployment finished successfully.

That's why I think we should keep this ticket and implement liveness and readiness probes as well.

TheAlgo commented 2 years ago

@bitnik I agree with you. Readiness is a must for production use cases, it can be a basic one but it is must. It already exists in opensearch chart but it is not properly refactored so is not usable. I will try to fix it.

sandervandegeijn commented 2 years ago

Any news? :)

sandervandegeijn commented 1 year ago

Guys? :)