Closed mwysokin closed 2 months ago
@mturoci Can we know if the port (10101) is still open even though waved is crashed? Because in the MLOps wave app we ping the TCP port as the health check of the container. If the port is still open then the container will still be detected as healthy.
cc: @ShehanIshanka
Can we know if the port (10101) is still open even though waved is crashed
You can check, but I would be surprised if that was the case.
Why would your app crash connecting to waved, but healthcheck would pass?
You can check, but I would be surprised if that was the case.
Are there any steps to reproduce it locally or on a dev environment?
Closing due to not being able to repro, seems like a Keycloak misconfiguration.
The place where panic happens is caused by token
being nil
which is something that should never happen according to docs, making me believe the root cause is auth provider misconfiguration of some sort.
Feel free to reopen in case you manage to repro.
It also happened on our dev instances: https://h2oai.slack.com/archives/C068QB11XV4/p1702298998164059
Now it's happening on cloud-qa too https://h2oai.slack.com/archives/G01C9KKQLAC/p1704455231835909
Seeing this in 23.10.0 testing as well
@codyharris-h2o-ai what app? @dulajra the link is dead
Just FYI The debug version of wave has been deployed both in MC and in cloud-qa.
@codyharris-h2o-ai If you see it during release testing maybe you could use this image instead: "gcr.io/vorvan/h2oai/mlops-wave-app-standalone:0.62.1-resourcefix-debugpanic" Just for debug purposes. It shouldn't be released as part of the release. @mturoci kindle implemented some additional logic to help with debugging.
@mturoci the mlops wave ui
I'm not sure how often we're running into this
Another customer is seeing this in their production environment for their first party app
Another occurrence of this on internal.dedicated https://h2oai.slack.com/archives/C8MA5HGUU/p1708600075172279
@dulajra, which version of Wave? We have seen positive results using 1.0.2
Closed in #2246. Feel free to reopen if appears on the recent versions.
Wave SDK Version, OS
0.26.2, Kubernetes (Managed Cloud)
Actual behavior
A wave app crashed but for some reason the container stayed up. This caused outage for at least one customer.
A very similar panic happened at least once before: https://github.com/h2oai/wave/discussions/1949