h2oai / wave

Realtime Web Apps and Dashboards for Python and R
https://wave.h2o.ai
Apache License 2.0
3.9k stars 323 forks source link

Waved panicked #2185

Closed mwysokin closed 2 months ago

mwysokin commented 7 months ago

Wave SDK Version, OS

0.26.2, Kubernetes (Managed Cloud)

Actual behavior

A wave app crashed but for some reason the container stayed up. This caused outage for at least one customer.

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x77b60a]

goroutine 9288 [running]:
github.com/h2oai/wave.(*App).send(0xc0001aa000, {0x0, 0x0}, 0xc0001781e0, {0xb07f40, 0x2a, 0x2a})
    /home/runner/work/wave/wave/app.go:112 +0x54a
github.com/h2oai/wave.(*App).forward(0xc0001aa000, {0x0?, 0x0?}, 0x0?, {0xb07f40?, 0xc0004e0050?, 0x8ef900?})
    /home/runner/work/wave/wave/app.go:89 +0x2f
github.com/h2oai/wave.(*Broker).resetClients.func1(0xc0004e8390?)
    /home/runner/work/wave/wave/broker.go:214 +0x36
created by github.com/h2oai/wave.(*Broker).resetClients
    /home/runner/work/wave/wave/broker.go:213 +0x98

A very similar panic happened at least once before: https://github.com/h2oai/wave/discussions/1949

dulajra commented 7 months ago

@mturoci Can we know if the port (10101) is still open even though waved is crashed? Because in the MLOps wave app we ping the TCP port as the health check of the container. If the port is still open then the container will still be detected as healthy.

cc: @ShehanIshanka

mturoci commented 7 months ago

Can we know if the port (10101) is still open even though waved is crashed

You can check, but I would be surprised if that was the case.

Why would your app crash connecting to waved, but healthcheck would pass?

dulajra commented 7 months ago

You can check, but I would be surprised if that was the case.

Are there any steps to reproduce it locally or on a dev environment?

mturoci commented 6 months ago

Closing due to not being able to repro, seems like a Keycloak misconfiguration.

The place where panic happens is caused by token being nil which is something that should never happen according to docs, making me believe the root cause is auth provider misconfiguration of some sort.

Feel free to reopen in case you manage to repro.

gabrielstar commented 6 months ago

It also happened on our dev instances: https://h2oai.slack.com/archives/C068QB11XV4/p1702298998164059

dulajra commented 5 months ago

Now it's happening on cloud-qa too https://h2oai.slack.com/archives/G01C9KKQLAC/p1704455231835909

codyharris-h2o-ai commented 5 months ago

Seeing this in 23.10.0 testing as well

mturoci commented 5 months ago

@codyharris-h2o-ai what app? @dulajra the link is dead

mwysokin commented 5 months ago

Just FYI The debug version of wave has been deployed both in MC and in cloud-qa.

mwysokin commented 5 months ago

@codyharris-h2o-ai If you see it during release testing maybe you could use this image instead: "gcr.io/vorvan/h2oai/mlops-wave-app-standalone:0.62.1-resourcefix-debugpanic" Just for debug purposes. It shouldn't be released as part of the release. @mturoci kindle implemented some additional logic to help with debugging.

codyharris-h2o-ai commented 5 months ago

@mturoci the mlops wave ui

I'm not sure how often we're running into this

codyharris-h2o-ai commented 5 months ago

Another customer is seeing this in their production environment for their first party app

image
dulajra commented 4 months ago

Another occurrence of this on internal.dedicated https://h2oai.slack.com/archives/C8MA5HGUU/p1708600075172279

wave-app.log

codyharris-h2o-ai commented 4 months ago

@dulajra, which version of Wave? We have seen positive results using 1.0.2

mturoci commented 2 months ago

Closed in #2246. Feel free to reopen if appears on the recent versions.