h2oai / wave

Realtime Web Apps and Dashboards for Python and R
https://wave.h2o.ai
Apache License 2.0
3.9k stars 324 forks source link

fix: Introduce request retry mechanism to Waved to retry failed requests #2035

Closed senalw closed 1 year ago

senalw commented 1 year ago

Issue: Currently Waved disconnects app when a request failure occurs between Waved and app. Therefore, this PR will introduce a request retry mechanism to retry failed requests for a certain amount of time period which can be configured by a user. If this retry count (H2O_WAVE_MAX_REQUEST_RETRY_COUNT) is not configured, it wont enable request retry mechanism.

Testing Conditions, H2O_WAVE_MAX_REQUEST_RETRY_COUNT=20 and H2O_WAVE_REQUEST_RETRY_INTERVAL=1s

Testing Records,

  1. Simulate read: connection reset by peer disconnection for 2 seconds and reset back to normal with deleting iptables rule after 2 seconds.

sleep 5 && iptables -A INPUT -p tcp --dport 8000 -j REJECT --reject-with tcp-reset && sleep 2 && iptables -D INPUT -p tcp --dport 8000 -j REJECT --reject-with tcp-reset

Result: App was able to successfully recover after 2 seconds,

2023/06/23 13:59:21 # {"error":"request failed: Post \"http://127.0.0.1:8000\": dial tcp 127.0.0.1:8000: connect: connection refused","host":"http://127.0.0.1:8000","route":"/","t":"app"}
2023/06/23 13:59:21 # {"addr":"172.17.0.1:58704","route":"/4f4b7de4-1f62-4326-b37b-855454c4c61a","t":"ui_add"}
2023/06/23 13:59:21 # {"error":"request failed: Post \"http://127.0.0.1:8000\": dial tcp 127.0.0.1:8000: connect: connection refused","host":"http://127.0.0.1:8000","max retry count":"20","retry count":"1","retry interval":"1s","route":"/","t":"app"}
2023/06/23 13:59:21 # {"error":"request failed: Post \"http://127.0.0.1:8000\": dial tcp 127.0.0.1:8000: connect: connection refused","host":"http://127.0.0.1:8000","max retry count":"20","retry count":"2","retry interval":"1s","route":"/","t":"app"}
INFO:     127.0.0.1:49624 - "POST / HTTP/1.1" 200 OK

https://github.com/h2oai/wave/assets/12801761/d0e50281-fe8b-4136-99b8-15a68cd7afc8

  1. Simulate read: connection reset by peer disconnection until requests reach H2O_WAVE_MAX_REQUEST_RETRY_COUNT and drops the app sleep 5 && iptables -A INPUT -p tcp --dport 8000 -j REJECT --reject-with tcp-reset

Result: App drops after requests reach maximum retry count and unable to recover the connection.

2023/06/23 17:55:09 # {"error":"request failed: Post \"http://127.0.0.1:8000\": read tcp 127.0.0.1:52970-\u003e127.0.0.1:8000: read: connection reset by peer","host":"http://127.0.0.1:8000","route":"/","t":"app"}
2023/06/23 17:55:09 # {"error":"request failed: Post \"http://127.0.0.1:8000\": dial tcp 127.0.0.1:8000: connect: connection refused","host":"http://127.0.0.1:8000","max retry count":"20","retry count":"1","retry interval":"1s","route":"/","t":"app"}
2023/06/23 17:55:09 # {"error":"request failed: Post \"http://127.0.0.1:8000\": dial tcp 127.0.0.1:8000: connect: connection refused","host":"http://127.0.0.1:8000","max retry count":"20","retry count":"3","retry interval":"1s","route":"/","t":"app"}
2023/06/23 17:55:10 # {"error":"request failed: Post \"http://127.0.0.1:8000\": dial tcp 127.0.0.1:8000: connect: connection refused","host":"http://127.0.0.1:8000","max retry count":"20","retry count":"2","retry interval":"1s","route":"/","t":"app"}
2023/06/23 17:55:10 # {"error":"request failed: Post \"http://127.0.0.1:8000\": dial tcp 127.0.0.1:8000: connect: connection refused","host":"http://127.0.0.1:8000","max retry count":"20","retry count":"4","retry interval":"1s","route":"/","t":"app"}
2023/06/23 17:55:11 # {"error":"request failed: Post \"http://127.0.0.1:8000\": dial tcp 127.0.0.1:8000: connect: connection refused","host":"http://127.0.0.1:8000","max retry count":"20","retry count":"3","retry interval":"1s","route":"/","t":"app"}
2023/06/23 17:55:11 # {"error":"request failed: Post \"http://127.0.0.1:8000\": dial tcp 127.0.0.1:8000: connect: connection refused","host":"http://127.0.0.1:8000","max retry count":"20","retry count":"5","retry interval":"1s","route":"/","t":"app"}
2023/06/23 17:55:12 # {"error":"request failed: Post \"http://127.0.0.1:8000\": dial tcp 127.0.0.1:8000: connect: connection refused","host":"http://127.0.0.1:8000","max retry count":"20","retry count":"4","retry interval":"1s","route":"/","t":"app"}
2023/06/23 17:55:12 # {"error":"request failed: Post \"http://127.0.0.1:8000\": dial tcp 127.0.0.1:8000: connect: connection refused","host":"http://127.0.0.1:8000","max retry count":"20","retry count":"6","retry interval":"1s","route":"/","t":"app"}
2023/06/23 17:55:13 # {"error":"request failed: Post \"http://127.0.0.1:8000\": dial tcp 127.0.0.1:8000: connect: connection refused","host":"http://127.0.0.1:8000","max retry count":"20","retry count":"5","retry interval":"1s","route":"/","t":"app"}
2023/06/23 17:55:13 # {"error":"request failed: Post \"http://127.0.0.1:8000\": dial tcp 127.0.0.1:8000: connect: connection refused","host":"http://127.0.0.1:8000","max retry count":"20","retry count":"7","retry interval":"1s","route":"/","t":"app"}
2023/06/23 17:55:14 # {"error":"request failed: Post \"http://127.0.0.1:8000\": dial tcp 127.0.0.1:8000: connect: connection refused","host":"http://127.0.0.1:8000","max retry count":"20","retry count":"6","retry interval":"1s","route":"/","t":"app"}
2023/06/23 17:55:14 # {"error":"request failed: Post \"http://127.0.0.1:8000\": dial tcp 127.0.0.1:8000: connect: connection refused","host":"http://127.0.0.1:8000","max retry count":"20","retry count":"8","retry interval":"1s","route":"/","t":"app"}
2023/06/23 17:55:15 # {"error":"request failed: Post \"http://127.0.0.1:8000\": dial tcp 127.0.0.1:8000: connect: connection refused","host":"http://127.0.0.1:8000","max retry count":"20","retry count":"7","retry interval":"1s","route":"/","t":"app"}
2023/06/23 17:55:15 # {"error":"request failed: Post \"http://127.0.0.1:8000\": dial tcp 127.0.0.1:8000: connect: connection refused","host":"http://127.0.0.1:8000","max retry count":"20","retry count":"9","retry interval":"1s","route":"/","t":"app"}
2023/06/23 17:55:16 # {"error":"request failed: Post \"http://127.0.0.1:8000\": dial tcp 127.0.0.1:8000: connect: connection refused","host":"http://127.0.0.1:8000","max retry count":"20","retry count":"8","retry interval":"1s","route":"/","t":"app"}
2023/06/23 17:55:16 # {"error":"request failed: Post \"http://127.0.0.1:8000\": dial tcp 127.0.0.1:8000: connect: connection refused","host":"http://127.0.0.1:8000","max retry count":"20","retry count":"10","retry interval":"1s","route":"/","t":"app"}
2023/06/23 17:55:17 # {"error":"request failed: Post \"http://127.0.0.1:8000\": dial tcp 127.0.0.1:8000: connect: connection refused","host":"http://127.0.0.1:8000","max retry count":"20","retry count":"9","retry interval":"1s","route":"/","t":"app"}
2023/06/23 17:55:17 # {"error":"request failed: Post \"http://127.0.0.1:8000\": dial tcp 127.0.0.1:8000: connect: connection refused","host":"http://127.0.0.1:8000","max retry count":"20","retry count":"11","retry interval":"1s","route":"/","t":"app"}
2023/06/23 17:55:18 # {"error":"request failed: Post \"http://127.0.0.1:8000\": dial tcp 127.0.0.1:8000: connect: connection refused","host":"http://127.0.0.1:8000","max retry count":"20","retry count":"10","retry interval":"1s","route":"/","t":"app"}
2023/06/23 17:55:18 # {"error":"request failed: Post \"http://127.0.0.1:8000\": dial tcp 127.0.0.1:8000: connect: connection refused","host":"http://127.0.0.1:8000","max retry count":"20","retry count":"12","retry interval":"1s","route":"/","t":"app"}
2023/06/23 17:55:19 # {"error":"request failed: Post \"http://127.0.0.1:8000\": dial tcp 127.0.0.1:8000: connect: connection refused","host":"http://127.0.0.1:8000","max retry count":"20","retry count":"11","retry interval":"1s","route":"/","t":"app"}
2023/06/23 17:55:19 # {"error":"request failed: Post \"http://127.0.0.1:8000\": dial tcp 127.0.0.1:8000: connect: connection refused","host":"http://127.0.0.1:8000","max retry count":"20","retry count":"13","retry interval":"1s","route":"/","t":"app"}
2023/06/23 17:55:20 # {"error":"request failed: Post \"http://127.0.0.1:8000\": dial tcp 127.0.0.1:8000: connect: connection refused","host":"http://127.0.0.1:8000","max retry count":"20","retry count":"12","retry interval":"1s","route":"/","t":"app"}
2023/06/23 17:55:20 # {"error":"request failed: Post \"http://127.0.0.1:8000\": dial tcp 127.0.0.1:8000: connect: connection refused","host":"http://127.0.0.1:8000","max retry count":"20","retry count":"14","retry interval":"1s","route":"/","t":"app"}
2023/06/23 17:55:21 # {"error":"request failed: Post \"http://127.0.0.1:8000\": dial tcp 127.0.0.1:8000: connect: connection refused","host":"http://127.0.0.1:8000","max retry count":"20","retry count":"13","retry interval":"1s","route":"/","t":"app"}
2023/06/23 17:55:21 # {"error":"request failed: Post \"http://127.0.0.1:8000\": dial tcp 127.0.0.1:8000: connect: connection refused","host":"http://127.0.0.1:8000","max retry count":"20","retry count":"15","retry interval":"1s","route":"/","t":"app"}
2023/06/23 17:55:22 # {"error":"request failed: Post \"http://127.0.0.1:8000\": dial tcp 127.0.0.1:8000: connect: connection refused","host":"http://127.0.0.1:8000","max retry count":"20","retry count":"14","retry interval":"1s","route":"/","t":"app"}
2023/06/23 17:55:22 # {"error":"request failed: Post \"http://127.0.0.1:8000\": dial tcp 127.0.0.1:8000: connect: connection refused","host":"http://127.0.0.1:8000","max retry count":"20","retry count":"16","retry interval":"1s","route":"/","t":"app"}
2023/06/23 17:55:23 # {"error":"request failed: Post \"http://127.0.0.1:8000\": dial tcp 127.0.0.1:8000: connect: connection refused","host":"http://127.0.0.1:8000","max retry count":"20","retry count":"15","retry interval":"1s","route":"/","t":"app"}
2023/06/23 17:55:23 # {"error":"request failed: Post \"http://127.0.0.1:8000\": dial tcp 127.0.0.1:8000: connect: connection refused","host":"http://127.0.0.1:8000","max retry count":"20","retry count":"17","retry interval":"1s","route":"/","t":"app"}
2023/06/23 17:55:24 # {"error":"request failed: Post \"http://127.0.0.1:8000\": dial tcp 127.0.0.1:8000: connect: connection refused","host":"http://127.0.0.1:8000","max retry count":"20","retry count":"16","retry interval":"1s","route":"/","t":"app"}
2023/06/23 17:55:24 # {"error":"request failed: Post \"http://127.0.0.1:8000\": dial tcp 127.0.0.1:8000: connect: connection refused","host":"http://127.0.0.1:8000","max retry count":"20","retry count":"18","retry interval":"1s","route":"/","t":"app"}
2023/06/23 17:55:25 # {"error":"request failed: Post \"http://127.0.0.1:8000\": dial tcp 127.0.0.1:8000: connect: connection refused","host":"http://127.0.0.1:8000","max retry count":"20","retry count":"17","retry interval":"1s","route":"/","t":"app"}
2023/06/23 17:55:25 # {"error":"request failed: Post \"http://127.0.0.1:8000\": dial tcp 127.0.0.1:8000: connect: connection refused","host":"http://127.0.0.1:8000","max retry count":"20","retry count":"19","retry interval":"1s","route":"/","t":"app"}
2023/06/23 17:55:26 # {"error":"request failed: Post \"http://127.0.0.1:8000\": dial tcp 127.0.0.1:8000: connect: connection refused","host":"http://127.0.0.1:8000","max retry count":"20","retry count":"18","retry interval":"1s","route":"/","t":"app"}
2023/06/23 17:55:26 # {"error":"request failed: Post \"http://127.0.0.1:8000\": dial tcp 127.0.0.1:8000: connect: connection refused","host":"http://127.0.0.1:8000","max retry count":"20","retry count":"20","retry interval":"1s","route":"/","t":"app"}
2023/06/23 17:55:27 # {"error":"request failed: Post \"http://127.0.0.1:8000\": dial tcp 127.0.0.1:8000: connect: connection refused","host":"http://127.0.0.1:8000","max retry count":"20","retry count":"19","retry interval":"1s","route":"/","t":"app"}
2023/06/23 17:55:27 App wasn't be able to recover with 20 attempts. [Retry duration: 20s]
2023/06/23 17:55:27 # {"route":"/","t":"app_drop"}
2023/06/23 17:55:27 # {"addr":"172.17.0.1:62086","t":"ui_drop"}
2023/06/23 17:55:28 # {"addr":"172.17.0.1:62182","route":"/","t":"ui_add"}
2023/06/23 17:55:28 # {"addr":"172.17.0.1:62102","t":"ui_drop"}

https://github.com/h2oai/wave/assets/12801761/0d1a484a-9323-4bcc-a25f-7d650b69b29e

senalw commented 1 year ago

Thank you @senalw for working on this 🙏

I'm neither a go expert nor an expert with the wave codebase so I'm not the right person to decide whether this PR should go in or not, just a couple of comments from my side:

  • This continues to be the only release blocker for MLOps 0.62.0. We won't be able to release our software because of it and we're already late by a couple of weeks. So any help from the wave team @mturoci @lo5 will be extremely appreciated 🙏
  • @senalw I don't see any linked issue in the PR. Did we not create an issue? I know there have been slack conversations but the issues are how the work is being tracked so please create one if it's not there already.
  • @senalw Since there's a go package for everything nowadays why try to implement this yourself if you can use an already existing and tested and much more configurable package for it?

Thanks for the suggestions. I linked the issue for this PR.

lo5 commented 1 year ago

This PR is a needlessly complicated solution to what's obviously an availability / resource-scheduling problem on your side.

If you really want the wave server to not drop the app after a TCP reset, the simple fix is to introduce a flag that will do exactly that.

e.g.

func (app *App) forward(clientID string, session *Session, data []byte) {
    if err := app.send(clientID, session, data); err != nil {
        echo(Log{"t": "app", "route": app.route, "host": app.addr, "error": err.Error()})
                if (dropIfUnresponsive) { // get global setting from env var or arg, default true to preserve current behavior.
                app.broker.dropApp(app.route)
                }
    }
}

This way, the underlying http.Client will automatically re-establish the connection if possible.

mturoci commented 1 year ago

@lo5 comment implementation - https://github.com/h2oai/wave/pull/2050

senalw commented 1 year ago

Closing this PR as fix is provided with this PR.