h2oai / wave

Realtime Web Apps and Dashboards for Python and R
https://wave.h2o.ai
Apache License 2.0
3.9k stars 323 forks source link

"read: connection reset by peer" makes app unusable #2304

Closed g-eoj closed 2 months ago

g-eoj commented 3 months ago

Wave SDK Version, OS

Wave 1.1.1, H2O Cloud app store

Actual behavior

If a message like 2024/03/28 14:19:14 # {"error":"request failed: Post \"http://127.0.0.1:8000\": read tcp 127.0.0.1:35236-\u003e127.0.0.1:8000: read: connection reset by peer","host":"http://127.0.0.1:8000","route":"/","t":"app"} appears in the logs, the app gets stuck:

Screenshot 2024-03-28 at 7 46 12 AM

Refreshing the page or opening a new window does not fix it:

2024/03/28 14:19:14 # {"error":"request failed: Post \"http://127.0.0.1:8000/\": read tcp 127.0.0.1:35236-\u003e127.0.0.1:8000: read: connection reset by peer","host":"http://127.0.0.1:8000/","route":"/","t":"app"}
2024/03/28 14:19:14 # {"route":"/","t":"app_drop"}
2024/03/28 14:19:15 # {"addr":"108.205.202.15, 10.1.152.76","client_id":"ad099ed7-ce1f-45ac-9dbb-fb69a4e99ee6","t":"client_reconnect"}
2024/03/28 14:19:15 # {"addr":"108.205.202.15, 10.1.131.124","route":"/","t":"ui_add"}
2024/03/28 14:19:15 # {"addr":"108.205.202.15, 10.1.152.76","route":"/","t":"ui_add"}
2024/03/28 14:19:16 # {"addr":"108.205.202.15, 10.1.131.124","route":"/","t":"ui_add"}
2024/03/28 14:19:16 # {"addr":"108.205.202.15, 10.1.131.124","route":"/","t":"ui_add"}
2024/03/28 14:19:19 # {"client":"22ba5928-7dd8-43f4-97d7-8a238035d7c5","t":"client_unsubscribe"}
2024/03/28 14:19:19 # {"addr":"108.205.202.15, 10.1.131.124","t":"ui_drop"}
2024/03/28 14:19:19 # {"client":"6a41dc7c-9748-457d-9a21-96c386cc0ceb","t":"client_unsubscribe"}
2024/03/28 14:19:19 # {"addr":"108.205.202.15, 10.1.152.76","t":"ui_drop"}
2024/03/28 14:19:19 # {"client":"ed818b4d-1e89-49a1-86b9-995e1bf20a27","t":"client_unsubscribe"}
2024/03/28 14:19:19 # {"addr":"108.205.202.15, 10.1.152.76","t":"ui_drop"}
2024/03/28 14:19:20 # {"client":"ad099ed7-ce1f-45ac-9dbb-fb69a4e99ee6","t":"client_unsubscribe"}
2024/03/28 14:19:20 # {"addr":"108.205.202.15, 10.1.152.76","t":"ui_drop"}
2024/03/28 14:20:38 # {"addr":"108.205.202.15, 10.1.131.124","route":"/","t":"ui_add"}
2024/03/28 14:20:42 # {"client":"a77b57d4-42b0-4dca-8c7a-664807b0e76c","t":"client_unsubscribe"}
2024/03/28 14:20:42 # {"addr":"108.205.202.15, 10.1.131.124","t":"ui_drop"}
2024/03/28 14:20:56 # {"addr":"108.205.202.15, 10.1.152.76","client_id":"924c80b4-cff2-45b8-9c70-4a283bba8668","t":"client_reconnect"}
2024/03/28 14:20:56 # {"addr":"108.205.202.15, 10.1.152.76","route":"/","t":"ui_add"}
2024/03/28 14:21:01 # {"client":"924c80b4-cff2-45b8-9c70-4a283bba8668","t":"client_unsubscribe"}
2024/03/28 14:21:01 # {"addr":"108.205.202.15, 10.1.152.76","t":"ui_drop"}
2024/03/28 14:21:03 # {"client":"094c1a1a-61b5-435b-b91e-e8ec55bd2e29","t":"client_unsubscribe"}
2024/03/28 14:21:03 # {"addr":"108.205.202.15, 10.1.131.124","t":"ui_drop"}
2024/03/28 14:21:04 # {"client":"c613e622-d8b6-47f8-941a-4269e9347d9d","t":"client_unsubscribe"}
2024/03/28 14:21:04 # {"addr":"108.205.202.15, 10.1.152.76","t":"ui_drop"}
2024/03/28 14:21:04 # {"client":"3cc46f9c-17cd-4700-ba00-e884ee018156","t":"client_unsubscribe"}
2024/03/28 14:21:04 # {"addr":"108.205.202.15, 10.1.131.124","t":"ui_drop"}
2024/03/28 14:21:11 # {"addr":"108.205.202.15, 10.1.131.124","client_id":"901a4f97-152f-4f43-a5c0-39ba16c17cac","t":"client_reconnect"}
2024/03/28 14:21:11 # {"addr":"108.205.202.15, 10.1.131.124","route":"/","t":"ui_add"}
2024/03/28 14:21:16 # {"client":"901a4f97-152f-4f43-a5c0-39ba16c17cac","t":"client_unsubscribe"}
2024/03/28 14:21:16 # {"addr":"108.205.202.15, 10.1.131.124","t":"ui_drop"}

I do not have a simple repro. I can share that continually clicking a button that triggers the following code, where q.client.deployment is a H2O MLOps Python client ref, will cause the error:

q.page["meta"].dialog = None
q.page["deployment"].logs.value = str("\n".join(q.client.deployment.tail_logs(15)))
await q.page.save()
mturoci commented 2 months ago

Duplicate of https://github.com/h2oai/wave/issues/2043. We had a Slack discussion wrt to this and learned that MLOPs is deployed in such a way that the wave app pod is starved (due to DAI IIRC). Since waved sees wave app as unreachable (gets RST packet), it drops the app as it's considered dead.

This behavior can be altered by setting H2O_WAVE_KEEP_APP_LIVE env var, but Wave team's recommendation is to fix the deployment to guarantee that pod has at least minimal resources in order for the process to be not suspended (and not send RST packet).