Open alfogrillo opened 1 year ago
Is this only an issue with the matrix.org deployment? I'm pretty sure we no longer return HTML error pages inside of Synapse itself.
Is this only an issue with the matrix.org deployment? I'm pretty sure we no longer return HTML error pages inside of Synapse itself.
I tried that just under matrix.org
tbh
Is this only an issue with the matrix.org deployment? I'm pretty sure we no longer return HTML error pages inside of Synapse itself.
I tried that just under
matrix.org
tbh
Right, do also see this on element.io or a test server?
Right, do also see this on element.io or a test server?
Tried the get profile on element.io and the response looks good:
HTTP 502 - Bad gateway
{
"errcode": "M_UNKNOWN",
"error": "Failed to fetch profile"
}
In case of the create room:
HTTP 502 - Bad gateway
{
"errcode": "M_UNKNOWN",
"error": "Can't connect to server someone"
}
(and the room created anyway afterwards)
It'd be worth noting that we did have a chat about this in a meeting recently.
Whilst it's undeniably CloudFlare's fault for making a HTML error page from 502s, this can't be turned off as far as we know.
For the more severe case of creating a room and failing to invite a user, then we currently half-create the room then return a 502 error. Regardless of this error being HTML or JSON, a reasonable client would try to re-create the room. Possibly we should just fully create the room and ignore the error.
Oh -- is the complaint that Synapse returns 502 errors which Cloudflare turns into HTML pages?
Oh -- is the complaint that Synapse returns 502 errors which Cloudflare turns into HTML pages?
That's half of it, which I think we do want to consider (possibly at a spec level :/); one suggestion might be to designate another error code as 'a remote homeserver couldn't answer'.
The other half of the issue (room creation) is likely related to https://github.com/matrix-org/synapse/issues/8895: we half-create the room but then lie and tell the client the room creation failed (causing any good client to retry, particularly for a 502 which sounds like a reverse proxy-generated error).
I'm not sure it makes sense to change Synapse's behaviour and certainly not the Matrix spec merely due to dubious behaviour by third parties.
It may be more worthwhile to ensure clients are robust enough to not choke on unexpected or invalid response bodies. (In this case since the error code is still correct, it seems like a minor issue)
On the topic of invitation failures during room creation:
It's been noted that rolling on with the room creation and returning a 200 (which is the only current response that informs the client about the created room) would be no better (particularly in DMs), since the end result would be the application accepting messages it has no hope of delivering. Whilst it's true that the broken rooms created today can also accept messages, either the user will see some messy 502 error at creation (unfortunately not specific enough to understand), or with the retry scheme: the appearance of many rooms, that they would notice that something is wrong/unreliable/suspicious. Particularly in an emergency situation, this would give them the chance to try another communication channel since it at least doesn't give any false confidence.
That sets the requirements for an improvement to be something like this:
/invite
to rectify the situation.)Given that, we'd likely want to spec something new here to represent a partial success. Here's a rough outline for what I'd do for this exact case, which shouldn't be much server-side work:
room_id
as though it was a 200
, but additionally define new field failed_invites: ["@bob:bob.com"]
or failed_invites: {"@bob:bob.com": {errcode: M_<standard error code>, errmessage: "could not reach homeserver"}}
(or similar) so the client can give direct and helpful information
As a future extension, it would be good to consider how we can represent a total failure caused by an invitation failure. What I mean is: if every single invite fails (e.g. it's a DM and the only invite fails), it'd be good if Synapse was allowed to destroy the room since it was never created usefully in the first place.
To recap the above issues, we can't return a 502 for this because meddlesome middleboxes like CloudFlare will turn it into HTML error pages (and it's also ambiguous considering that reverse proxies give 502s when the homeserver itself is down).
We need to choose a different HTTP error code. Finally, it'd be good if the error response also included the failed_invites
section here so the client knows that's the reason and can prompt the user intelligently about it.
Backwards compatibility concerns: I'd expect existing clients to treat 600 or the other 'total failure' error as error conditions and do the same thing they do today. But such a scheme would give them the chance to give the user clearer error messages or even friendlier error handling. Today, clients have no choice with the information they receive.
I'd appreciate feedback on the above scheme. Ideally we can keep this light enough so it's doable without a lot of fuss (otherwise it'll never get done). Introducing a more rigorous error framework to Matrix is somewhat of a wish of mine but it'd be too heavyweight for this. On the chance that anyone fancies thinking through this and taking it to an MSC, I'd be grateful so please feel free.
For everything else: the 502s being swallowed by CloudFlare are a pain, but if they don't cause any real pain then the client handling should be vaguely similar. Maybe we can consider changing error code if it would be useful (this would be a different MSC to the one above), but if you make a /profile
query then as a client you should probably retry it whether it was your homeserver that was down or the remote homeserver that was down.
On the topic of invitation failures during room creation: [...] It'd be good if Synapse was allowed to destroy the room since it was never created usefully in the first place.
I think transaction rollback would be the most ideal solution. I'm guessing it's a bit harder to rollback here since invites to remote users go over federation?
Can we consider certain invite failures to be completely dead in the water? For example, if we get back a non-successful status code for all invites or fail to connect to all remote servers, then we can rollback. We can't make these guarantees for no response or a timeout though as the server could have received the message and will process it later.
In those cases, we would still end up needing some partial success like you're proposing anyway but we could be seamless in a lot of cases 🤷
Rollback is close to my heart because of problems like https://github.com/matrix-org/synapse/issues/15005 where we get a half-baked room and an alias pointing back at it from /createRoom
I think transaction rollback would be the most ideal solution.
Yeah, I agree there. Without having looked into it deeply, I'd say I expect it to be quite tricky to do, though — possibly made easier by the new batch event persistence though?
Description
Calling APIs with (possibly) non existent matrix identifiers (like:
@alfotest:some
) triggers non json responses from the home server with HTTP errors 502.Examples of such APIs are:
⚠️ The issue is particularly serious with the create room API (eg. with "invite": [
@alfotest:some
]) since even if the clients gets back an HTTP 502 and a HTML response, the room creation actually succeeds.Steps to reproduce
Homeserver
matrix.org
Synapse Version
{"server_version":"1.80.0 (b=matrix-org-hotfixes,ab0a5f1972,dirty)","python_version":"3.8.12"}
Installation Method
I don't know
Database
I don't know
Workers
I don't know
Platform
I don't know
Configuration
I don't know
Relevant log output
Anything else that would be useful to know?
No response