Web container crashes when health probes are configured

lindhe commented 6 months ago

Package version (if known): 0.2.1 (commit f7fd3c67a8cf2483c67893304893908040637647).

Describe the bug

The crashes of the web container in the web pod seems to begins just after helm install and continues even after wipe_recreate.sh (so it's not just that the system is unintialized). These are uncaught errors, so it's not very easy to know exactly what goes wrong here. I have attached the logs from the web container here, which shows two notable things:

Despite this being Python, we get a segfault. So there's obviously a buggy binary included somewhere here (myabe uwsgi).
It prints SIGINT/SIGTERM received...killing workers... and KeyboardInterrupt. Those signals are not coming from Kubernetes, because we have not configured a liveness probe (only startup and readiness). This indicates to me that there may be some built in health checks that's getting in the way for us here.

I want to emphasize that while a liveness probe may cause a SIGTERM signal, neither the startup nor readiness probes will cause that. The only thing they do, really, is to prevent traffic being routed there via the Service in Kubernetes. This further convinces me that we have a some built-in health probe in Invenio and/or uWSGI that freaks out, possibly under the condition that traffic to Invenio's hostname does not reach its destination.

This is a major issue, as far as I'm concerned, and we really need to fix it before this chart can be considered stable.

Steps to Reproduce

/tmp/values.yaml

```yaml haproxy: enabled: false host: invenio.example.com invenio: secret_key: "secret-key" security_login_salt: "security_login_salt" csrf_secret_salt: "csrf_secret_salt" ingress: annotations: cert-manager.io/cluster-issuer: letsencrypt-prod enabled: true class: nginx postgresql: enabled: true search: enabled: true web: image: ghcr.io/inveniosoftware/demo-inveniordm/demo-inveniordm@sha256:2193abc2caec9bc599061d6a5874fd2d7d201f55d1673a545af0a0406690e8a4 replicas: 1 resources: requests: cpu: 500m memory: 4Gi limits: cpu: 1000m memory: 4Gi worker: image: ghcr.io/inveniosoftware/demo-inveniordm/demo-inveniordm@sha256:2193abc2caec9bc599061d6a5874fd2d7d201f55d1673a545af0a0406690e8a4 replicas: 1 resources: requests: cpu: "2" memory: 1Gi limits: cpu: "2" memory: 1Gi workerBeat: resources: requests: cpu: "2" memory: 1Gi limits: cpu: "2" memory: 1Gi ```

helm install -n invenio-dev invenio ./charts/invenio/ -f /tmp/values.yaml --set 'rabbitmq.auth.password=hn6mqDjygkjhgkhjgzBKrsNNkao,postgresql.auth.password=P6ph7jkhgkjhgkjNkao'

Expected behavior

The web container should be able to have health probes configured without crashing.

Additional context

In #52, @avivace has posted a similar issue. Because he talks about "the liveness probe", I'm assume it's a different issue (since there is no liveness probe configured for the web pod). Unfortunately, the link he posted was not a permalink so it is broken now and I cannot verify what he means. So I opened this new issue instead.

When I've been working on improving this Helm chart, I have had to remove the health probes every time I install it to test something. For reference, this is what I run to remove the health probes and check that the pod is replaced:

OLD_WEB_RS="$(kubectl get rs -l app=web -o name)"
kubectl patch deployment web --type=json -p='[{"op": "remove", "path": "/spec/template/spec/containers/0/startupProbe"},{"op": "remove", "path": "/spec/template/spec/containers/0/readinessProbe"}]'
kubectl scale "${OLD_WEB_RS:?}" --replicas 0
kubectl rollout status deployment web

Logs

web.log

avivace commented 6 months ago

Hi @lindhe , did you mean to reference to https://github.com/inveniosoftware/helm-invenio/issues/52 instead of https://github.com/inveniosoftware/helm-invenio/issues/51? If so, this is what my link was pointing at the time: https://github.com/inveniosoftware/helm-invenio/blob/ab805e2e800619d8f464eb662bf28cf80a83bb14/invenio/templates/deployments/web.yaml#L93

lindhe commented 6 months ago

Yes, that's a typo! Sorry. It's #52 I mean.

Ah, right. So you did mean readiness probe and not liveness probe then? Typos are great! 😄

Should I close this ticket then and repost as a comment in your original Issue, or do we keep this one?

mirekys commented 4 months ago

Hi @lindhe, the logs you attached suggests it is triggering some known race condition in greenlet lib used by uwsgi when shutting down invenio app. That seems to be fixed in greenlet>=3, but your demo image still has:

greenlet==2.0.2

(I checked also the latest demo-inveniordm:12.0.0-beta.1.3 release, but seems to use the same version & also suffers with segfaults)

After upgrading it to latest 3.x:

#Dockerfile
FROM ghcr.io/inveniosoftware/demo-inveniordm/demo-inveniordm:12.0.0-beta.1.3

# Update greenlet to fixed version
RUN pip install greenlet">=3"
# See info below
ADD probe.sh /usr/local/bin/probe.sh
RUN chmod +x /usr/local/bin/probe.sh

docker build . -t demo-inveniordm:local

and replace: ...../demo-inveniordm@sha256:2193abc2caec9bc599061d6a5874fd2d7d201f55d1673a545af0a0406690e8a4 with docker.io/library/demo-inveniordm:local

I'm now getting some even more interesting parts of the original Python traceback: web-greenlet3.log

And its for sure triggered by uwsgi_curl probe.

UPDATE: I managed to fix the probe, it has to do with DB being not yet created, hostname, and shell env when it is run by k8s. If I put it inside script as:

#!/usr/bin/env bash
# probe.sh
/usr/local/bin/uwsgi_curl -X HEAD -H "Host: invenio.example.com" 127.0.0.1:5000 /ping

and change probe cmds to:

exec:
          command:
            - /bin/bash
            - -c
            - "/usr/local/bin/probe.sh"

and initialize DB by running:

invenio db create

it seems to work properly (do not ask me why 😅)

lindhe commented 4 months ago

Excellent findings, thank you for looking into this! I will try to verify and investigate further based on this. I'll be back! 😎

ntarocco commented 4 months ago

UPDATE: I managed to fix the probe, it has to do with DB being not yet created, hostname, and shell env when it is run by k8s. If I put it inside script as:
#!/usr/bin/env bash
# probe.sh
/usr/local/bin/uwsgi_curl -X HEAD -H "Host: invenio.example.com" 127.0.0.1:5000 /ping
and change probe cmds to:
exec:
          command:
            - /bin/bash
            - -c
            - "/usr/local/bin/probe.sh"
and initialize DB by running:
invenio db create
it seems to work properly (do not ask me why 😅)

Thank you for checking this and finding that it is related to the probe, I think that we were missing to find the perpetrator :) We discussed the DB part in the forum in Discord, and all probes will be working only after the DB init, basically when the Invenio app is up and running correctly. BTW, we did not use the bash file but the command directly so we could easily use a variable for the Host header.

mirekys commented 4 months ago

I share & summarize here my observations from further testing and verifications:

With threads >= 4 & processes >= 6, it always fails to initialize Flask app at pretty much random spots when loading entrypoints. This happens on my 16 detected cores setup.
When threads and/or processes are lower (4&5, 3&5,...), it depends on whether invenio db create has been called:
1. not called - throws the following sqlalchemy error and dies miserably:
```
sqlalchemy.exc.ProgrammingError: (psycopg2.errors.UndefinedTable) relation "banners" does not exist
LINE 2: FROM banners 
     ^
```

[SQL: SELECT banners.created AS banners_created, banners.updated AS banners_updated, banners.id AS banners_id, banners.message AS banners_message, banners.url_path AS banners_url_path, banners.category AS banners_category, banners.start_datetime AS banners_start_datetime, banners.end_datetime AS banners_end_datetime, banners.active AS banners_active FROM banners WHERE banners.active IS true AND banners.start_datetime <= %(start_datetime_1)s AND (banners.end_datetime IS NULL OR banners.end_datetime >= %(end_datetime_1)s) AND (banners.url_path IS NULL OR (%(param_1)s LIKE banners.url_path || '%%'))] [parameters: {'start_datetime_1': datetime.datetime(2024, 4, 16, 11, 28, 16, 684029), 'end_datetime_1': datetime.datetime(2024, 4, 16, 11, 28, 16, 684029), 'param_1': '/ping'}] (Background on this error at: https://sqlalche.me/e/14/f405) [pid: 7|app: 0|req: 3/3] () {16 vars in 188 bytes} [Tue Apr 16 13:28:16 2024] HEAD /ping => generated 0 bytes in 8 msecs (HTTP/1.1 500) 0 headers in 0 bytes (0 switches on core 3) SIGINT/SIGTERM received...killing workers... terminate called after throwing an instance of 'std::runtime_error' what(): Accessing state after destruction. worker 1 buried after 1 seconds


2. **not called, using API probe** - when I changed `uwsgi_curl` to query `/api/ping` instead of `/ping` it works independently of whether DB is fully initialized or not
3. **called** - both API & UI endpoint probes works
- It also seems to depend on whether I use `$(hostname)` or `127.0.0.1` in `uwsgi_curl` command:
1. **localhost** - it works
2. **`$(hostname)`** - dies miserably after first probe check

SIGINT/SIGTERM received...killing workers... terminate called after throwing an instance of 'std::runtime_error' what(): Accessing state after destruction.terminate called after throwing an instance of ' std::runtime_error' [deadlock-detector] a process holding a robust mutex died. recovering... what(): Accessing state after destruction. worker 1 buried after 1 seconds worker 2 buried after 1 seconds goodbye to uWSGI.

inveniosoftware / helm-invenio