add liveness/readiness probes

goern commented 4 years ago

Is your feature request related to a problem? Please describe. add a liveness/readiness probe to faust producer deployment, so that we can check if package-update is working.

Describe the solution you'd like

add a readiness probe that is ok if package-update is exposing metrics
add a liveness probe that is ok if package-update is connected to graph and kafka broker
rethink if this is good

Describe alternatives you've considered no probes

Additional context we should be able to do a basic testing if a new version of package-update is deployable and runnable.

fridex commented 4 years ago

This can be generalized to any message producer in deployment.

fridex commented 4 years ago

https://srcco.de/posts/kubernetes-liveness-probes-are-dangerous.html

fridex commented 4 years ago

This can be generalized to any message producer in deployment.

Except for user-facing api.

eagleusb commented 4 years ago

Hi there :wave:, I'm willing to help you regarding to adding liveness and readiness to package-update-job CronJob in your OKD cluster.

Do you have 5 minutes to onboard me on your release management process ? Additionally where does the package-update-job Kubernetes recipe live ?

Thanks, Leslie

fridex commented 4 years ago

Hi!

thanks for your interest!

We deploy from https://github.com/thoth-station/thoth-application repo, you can find package-update related bits in https://github.com/thoth-station/thoth-application/blob/master/package-update/base/cronjob.yaml

There is already existing liveness probe that could be changed:

https://github.com/thoth-station/thoth-application/blob/c29bfd2334bd9d1c63137938932699b954fd47d9/package-update/base/cronjob.yaml#L91-L96

It's worth to consider if the liveness probe should not be more sophisticated. See also related discussion at https://github.com/robinhood/faust/issues/286.

Sadly we do not have any public instance, accessible to let you test your changes but we can definitely cooperate.

F.

CC @KPostOffice @saisankargochhayat @pacospace

sesheta commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

fridex commented 3 years ago

/remove-lifecycle stale

KPostOffice commented 3 years ago

@fridex package-update is a cronjob, do probes make sense here?

fridex commented 3 years ago

@fridex package-update is a cronjob, do probes make sense here?

It makes sense to have a mechanism to kill the pod if it failed for some reason. Previously, we had issues from time to time, that a pod was stuck in pending state or in running state (but python interpreter was not running) due to some cluster issue. To prevent that, it might be good idea to configure activeDeadlineSeconds, also for other cronjobs we have.

sesheta commented 3 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

sesheta commented 3 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

/close

sesheta commented 3 years ago

@sesheta: Closing this issue.

In response to [this](https://github.com/thoth-station/package-update-job/issues/140#issuecomment-904277005): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

fridex commented 3 years ago

/remove-lifecycle rotten /reopen /triage accepted

sesheta commented 3 years ago

@fridex: Reopened this issue.

In response to [this](https://github.com/thoth-station/package-update-job/issues/140#issuecomment-904316972): >/remove-lifecycle rotten >/reopen >/triage accepted Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

sesheta commented 3 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

sesheta commented 3 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

/close

sesheta commented 3 years ago

@sesheta: Closing this issue.

In response to [this](https://github.com/thoth-station/package-update-job/issues/140#issuecomment-925517092): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

codificat commented 3 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

The bot was misbehaving at that time, as the issue was not flagged as rotten at that point. Fixing and adding a priority

/reopen /remove-lifecycle rotten /priority backlog

sesheta commented 3 years ago

@codificat: Reopened this issue.

In response to [this](https://github.com/thoth-station/package-update-job/issues/140#issuecomment-954565326): >> Rotten issues close after 30d of inactivity. Reopen the issue with `/reopen`. Mark the issue as fresh with `/remove-lifecycle rotten`. > >The bot was misbehaving at that time, as the issue was not flagged as rotten at that point. Fixing and adding a priority > >/reopen >/remove-lifecycle rotten >/priority backlog Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

sesheta commented 2 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

fridex commented 2 years ago

/remove-lifecycle stale

sesheta commented 2 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

sesheta commented 2 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

codificat commented 2 years ago

/lifecycle frozen

VannTen commented 2 years ago

/sig devsecops I don't think we receive traffic in the pod, so a readiness probe would not really make sense. For a liveness probe, isn't it simpler to simply exit with and error code while logging the error ? Instead of adding code to check, since it's only to SIGTERM anyway...

KPostOffice commented 2 years ago

/sig devsecops I don't think we receive traffic in the pod, so a readiness probe would not really make sense. For a liveness probe, isn't it simpler to simply exit with and error code while logging the error ? Instead of adding code to check, since it's only to SIGTERM anyway...

See the comment above about using activeDeadlineSeconds instead. This issue could probably do with an edit.

VannTen commented 2 years ago

It makes sense to have a mechanism to kill the pod if it failed for some reason. Previously, we had issues from time to time, that a pod was stuck in pending state or in running state (but python interpreter was not running) due to some cluster issue. To prevent that, it might be good idea to configure activeDeadlineSeconds, also for other cronjobs we have.

Not the same level though. activeDeadlineSeconds is for failing the Job, regardless of the pod(s) (excerpt from the doc: "The activeDeadlineSeconds applies to the duration of the job, no matter how many Pods are created. Once a Job reaches activeDeadlineSeconds, all of its running Pods are terminated and the Job status will become type: Failed with reason: DeadlineExceeded")

This issue could probably do with an edit.

Agreed. @goern could you clarify what the end goal is ? In particular, re-reading the description:

add a liveness/readiness probe to faust producer deployment, so that we can check if package-update is working.

we should be able to do a basic testing if a new version of package-update is deployable and runnable.

-> seems like two differents things (Operations vs Testing).

goern commented 2 years ago

from my point of view, it is an operational problem: package-update is a critical component, therefore we need to observe if it is working correctly. activeDeadlineSeconds seems to be a technical solution to 'an auto-healing attempt', but it does not help with observing this service.

Maybe we should close this and #49 and restate

As a Thoth Operator, I want to observe the package-update-job, so that I can figure out if it is being executed, and so that in the case of its failure a support issue is opened.

wdygt?

VannTen commented 2 years ago

On Wed, Sep 21, 2022 at 11:49:42PM -0700, Christoph Görn wrote:

from my point of view, it is an operational problem: package-update is a critical component, therefore we need to observe if it is working correctly. activeDeadlineSeconds seems to be a technical solution to 'an auto-healing attempt', but it does not help with observing this service.

That sums it up. activeDeadlineSeconds would prevent a deadlock/livelock to goes undetected for too long.

Maybe we should close this and #49 and restate

As a Thoth Operator, I want to observe the package-update-job, so that I can figure out if it is being executed, and so that in the case of its failure a support issue is opened.

wdygt?

Yes, I think we should rephrase the issue.

The metrics are already there for observing job state (kube-state-metrics has kube_job_failed it seems) so we would create an alert and maybe something like https://github.com/m-lab/alertmanager-github-receiver for hooking it up into github issues.

By the way, that approach (activeDeadlineSeconds or more broadly, timeout, which I suppose can be expressed in different ways for argo/tekton etc) would scale to other jobs we have.

For example, https://github.com/thoth-station/thoth-application/issues/2604 was an instance of a job not finishing (well the pod was crashing but was restarted).

thoth-station / package-update-job

add liveness/readiness probes #140