Open lindhe opened 6 months ago
Hi @lindhe , did you mean to reference to https://github.com/inveniosoftware/helm-invenio/issues/52 instead of https://github.com/inveniosoftware/helm-invenio/issues/51? If so, this is what my link was pointing at the time: https://github.com/inveniosoftware/helm-invenio/blob/ab805e2e800619d8f464eb662bf28cf80a83bb14/invenio/templates/deployments/web.yaml#L93
Yes, that's a typo! Sorry. It's #52 I mean.
Ah, right. So you did mean readiness probe and not liveness probe then? Typos are great! š
Should I close this ticket then and repost as a comment in your original Issue, or do we keep this one?
Hi @lindhe, the logs you attached suggests it is triggering some known race condition in greenlet lib used by uwsgi when shutting down invenio app. That seems to be fixed in greenlet>=3, but your demo image still has:
greenlet==2.0.2
(I checked also the latest demo-inveniordm:12.0.0-beta.1.3
release, but seems to use the same version & also suffers with segfaults)
After upgrading it to latest 3.x:
#Dockerfile
FROM ghcr.io/inveniosoftware/demo-inveniordm/demo-inveniordm:12.0.0-beta.1.3
# Update greenlet to fixed version
RUN pip install greenlet">=3"
# See info below
ADD probe.sh /usr/local/bin/probe.sh
RUN chmod +x /usr/local/bin/probe.sh
docker build . -t demo-inveniordm:local
and replace:
...../demo-inveniordm@sha256:2193abc2caec9bc599061d6a5874fd2d7d201f55d1673a545af0a0406690e8a4
with docker.io/library/demo-inveniordm:local
I'm now getting some even more interesting parts of the original Python traceback: web-greenlet3.log
And its for sure triggered by uwsgi_curl probe.
UPDATE: I managed to fix the probe, it has to do with DB being not yet created, hostname, and shell env when it is run by k8s. If I put it inside script as:
#!/usr/bin/env bash
# probe.sh
/usr/local/bin/uwsgi_curl -X HEAD -H "Host: invenio.example.com" 127.0.0.1:5000 /ping
and change probe cmds to:
exec:
command:
- /bin/bash
- -c
- "/usr/local/bin/probe.sh"
and initialize DB by running:
invenio db create
it seems to work properly (do not ask me why š )
Excellent findings, thank you for looking into this! I will try to verify and investigate further based on this. I'll be back! š
UPDATE: I managed to fix the probe, it has to do with DB being not yet created, hostname, and shell env when it is run by k8s. If I put it inside script as:
#!/usr/bin/env bash # probe.sh /usr/local/bin/uwsgi_curl -X HEAD -H "Host: invenio.example.com" 127.0.0.1:5000 /ping
and change probe cmds to:
exec: command: - /bin/bash - -c - "/usr/local/bin/probe.sh"
and initialize DB by running:
invenio db create
it seems to work properly (do not ask me why š )
Thank you for checking this and finding that it is related to the probe, I think that we were missing to find the perpetrator :)
We discussed the DB part in the forum in Discord, and all probes will be working only after the DB init, basically when the Invenio app is up and running correctly.
BTW, we did not use the bash file but the command directly so we could easily use a variable for the Host
header.
I share & summarize here my observations from further testing and verifications:
invenio db create
has been called:
sqlalchemy.exc.ProgrammingError: (psycopg2.errors.UndefinedTable) relation "banners" does not exist
LINE 2: FROM banners
^
[SQL: SELECT banners.created AS banners_created, banners.updated AS banners_updated, banners.id AS banners_id, banners.message AS banners_message, banners.url_path AS banners_url_path, banners.category AS banners_category, banners.start_datetime AS banners_start_datetime, banners.end_datetime AS banners_end_datetime, banners.active AS banners_active FROM banners WHERE banners.active IS true AND banners.start_datetime <= %(start_datetime_1)s AND (banners.end_datetime IS NULL OR banners.end_datetime >= %(end_datetime_1)s) AND (banners.url_path IS NULL OR (%(param_1)s LIKE banners.url_path || '%%'))] [parameters: {'start_datetime_1': datetime.datetime(2024, 4, 16, 11, 28, 16, 684029), 'end_datetime_1': datetime.datetime(2024, 4, 16, 11, 28, 16, 684029), 'param_1': '/ping'}] (Background on this error at: https://sqlalche.me/e/14/f405) [pid: 7|app: 0|req: 3/3] () {16 vars in 188 bytes} [Tue Apr 16 13:28:16 2024] HEAD /ping => generated 0 bytes in 8 msecs (HTTP/1.1 500) 0 headers in 0 bytes (0 switches on core 3) SIGINT/SIGTERM received...killing workers... terminate called after throwing an instance of 'std::runtime_error' what(): Accessing state after destruction. worker 1 buried after 1 seconds
2. **not called, using API probe** - when I changed `uwsgi_curl` to query `/api/ping` instead of `/ping` it works independently of whether DB is fully initialized or not
3. **called** - both API & UI endpoint probes works
- It also seems to depend on whether I use `$(hostname)` or `127.0.0.1` in `uwsgi_curl` command:
1. **localhost** - it works
2. **`$(hostname)`** - dies miserably after first probe check
SIGINT/SIGTERM received...killing workers... terminate called after throwing an instance of 'std::runtime_error' what(): Accessing state after destruction.terminate called after throwing an instance of ' std::runtime_error' [deadlock-detector] a process holding a robust mutex died. recovering... what(): Accessing state after destruction. worker 1 buried after 1 seconds worker 2 buried after 1 seconds goodbye to uWSGI.
Package version (if known): 0.2.1 (commit f7fd3c67a8cf2483c67893304893908040637647).
Describe the bug
The crashes of the web container in the web pod seems to begins just after
helm install
and continues even afterwipe_recreate.sh
(so it's not just that the system is unintialized). These are uncaught errors, so it's not very easy to know exactly what goes wrong here. I have attached the logs from the web container here, which shows two notable things:SIGINT/SIGTERM received...killing workers...
andKeyboardInterrupt
. Those signals are not coming from Kubernetes, because we have not configured a liveness probe (only startup and readiness). This indicates to me that there may be some built in health checks that's getting in the way for us here.I want to emphasize that while a liveness probe may cause a
SIGTERM
signal, neither the startup nor readiness probes will cause that. The only thing they do, really, is to prevent traffic being routed there via the Service in Kubernetes. This further convinces me that we have a some built-in health probe in Invenio and/or uWSGI that freaks out, possibly under the condition that traffic to Invenio's hostname does not reach its destination.This is a major issue, as far as I'm concerned, and we really need to fix it before this chart can be considered stable.
Steps to Reproduce
/tmp/values.yaml
```yaml haproxy: enabled: false host: invenio.example.com invenio: secret_key: "secret-key" security_login_salt: "security_login_salt" csrf_secret_salt: "csrf_secret_salt" ingress: annotations: cert-manager.io/cluster-issuer: letsencrypt-prod enabled: true class: nginx postgresql: enabled: true search: enabled: true web: image: ghcr.io/inveniosoftware/demo-inveniordm/demo-inveniordm@sha256:2193abc2caec9bc599061d6a5874fd2d7d201f55d1673a545af0a0406690e8a4 replicas: 1 resources: requests: cpu: 500m memory: 4Gi limits: cpu: 1000m memory: 4Gi worker: image: ghcr.io/inveniosoftware/demo-inveniordm/demo-inveniordm@sha256:2193abc2caec9bc599061d6a5874fd2d7d201f55d1673a545af0a0406690e8a4 replicas: 1 resources: requests: cpu: "2" memory: 1Gi limits: cpu: "2" memory: 1Gi workerBeat: resources: requests: cpu: "2" memory: 1Gi limits: cpu: "2" memory: 1Gi ```
Expected behavior
The web container should be able to have health probes configured without crashing.
Additional context
In #52, @avivace has posted a similar issue. Because he talks about "the liveness probe", I'm assume it's a different issue (since there is no liveness probe configured for the web pod). Unfortunately, the link he posted was not a permalink so it is broken now and I cannot verify what he means. So I opened this new issue instead.
When I've been working on improving this Helm chart, I have had to remove the health probes every time I install it to test something. For reference, this is what I run to remove the health probes and check that the pod is replaced:
Logs
web.log