Closed tigrannajaryan closed 2 years ago
Also, this is a slightly different reason why being throttled, but should we also mention HTTP 429 Too Many Requests
next to 503? (I believe the behavior would be the same as for 503 otherwise)
I could never figure out what is the use case for 429 when 503 achieves exactly the same from client's perspective. 429 is also supposed to return exactly the same Retry-After header. The slight nuance seems to be that 503 means server is overloaded and unable to respond properly, while 429 says that it could respond but will not because the client is exceed some sort of quota.
However, I can't think of anything that the Agent could be doing differently in response to 503 and 429, so I don't know what we gain by adding both to the spec. Perhaps the benefit is in helping the troubleshooting from the Agent side (the Agent can log the response code)?
I experimented a bit with 503 responses in Go and JS.
In Go Websocket client (WSGorilla) the response code is clearly accessible, so no problem here, it can be used to indicate throttling.
In JS the 503 response results in an "error" event. This event is indistinguishable from a connection error when the server is down. This seems to be OK, since the protocol defines that if the connection cannot be established exponential backoff should be used, which essentially almost is what we also want to happen for 503 responses (minus the ability to specify exact retry interval). So, for JS it also seems to be OK to use this approach.
We do not have another good way to convey the unavailability of the server, unless we mandate the server to accept the Websocket connection and then send a ServerErrorResponse
message with UNAVAILABLE
error. However sending the ServerErrorResponse
message is more expensive than just returning 503, the Server may simply not be able to do that.
The only alternate I can think of is just get rid of 503 response altogether and say that when overloaded the server simply should reject connections. But this would reduce the troubleshootability for other language implementations where the response code is easily accessible.
Given the above I think we should keep 503 response for throttling, since it does not cause problem for JS, is useful for troubleshooting compared to rejecting connection, and is less expensive than the ServerErrorResponse
message.
Note that we still do need ServerErrorResponse
message response because the overloading situation may occur after the WebSocket connection exists for a while. In this case we don't want to just disconnect since it will result in immediate re-connection, we also can't send an HTTP response code because it is too late, so we need to convey this information via a WebSocket message.
Any thoughts?
Coming back to 429. This response is typically tied to a particular client or group of clients. It may not be possible to meaningfully calculate the rate and send a 429 response before the Agent's instance uid is known. That happens after the HTTP response headers are sent, and by then it is too late to try to send 429 response back.
So, in practice it may be impossible to have per-Agent rate limiting and 429 responses tied to that. We may still be able to tie this to auth information which supposedly is available before the WebSocket is established. This auth is another open issue that we need to answer before we can be sure about this.
Yeah, I believe that 503 is more universal and handling it would be enough. I am just wondering if the specification should be prescriptive whether 429 could be used or not. I imagine that in certain environments there might be some 429 throttling capabilities available already and having agent supporting that would be helpful, especially since the agent would behave the same way as for 503
So perhaps we can just say that either 429 or 503 SHOULD be used for throttling and from protocol perspective there is no difference (the Agent SHOULD be ready to handle both, the Server can choose to use one or the other or both as applicable).
So perhaps we can just say that either 429 or 503 SHOULD be used for throttling and from protocol perspective there is no difference (the Agent SHOULD be ready to handle both, the Server can choose to use one or the other or both as applicable).
Yes, that's exactly what I had on mind. I can prepare a quick PR with that change
For WebSockets, I asked @kkruk-sumo for consultation (we used to use that technology extensively in one of the previous projects), maybe he will have some suggestions here
Yes, that's exactly what I had on mind. I can prepare a quick PR with that change
Please do.
I think https://github.com/open-telemetry/opamp-spec/pull/37 resolves this. Closing, can reopen if needed.
Unfortunately, JS needs a special treatment here and if it's needed to be supported, I would consider to:
@kkruk-sumo please clarify what you think will not work for JS with the spec as is. Returning 503 to JS results in a connection "error" event in the browser. The spec already defines what the client needs to do when to do when there is a connection error (exponential backoff). I think that should be sufficient for JS Agents. If you think otherwise please tell.
@tigrannajaryan I believe that "The minimum recommended retry interval is 30 seconds." from the spec is enough for JS.
The spec says that the server can respond with 503 to throttle the client.
However, reading the http response code may not be possible with some WebSocket clients, e.g. for browsers (it is disabled for security reasons).
Should we still support this throttling method or avoid it?