Incorrect HTML responses when providing possibly non existent matrix identifiers

alfogrillo commented 1 year ago

Description

Calling APIs with (possibly) non existent matrix identifiers (like: @alfotest:some) triggers non json responses from the home server with HTTP errors 502.

Examples of such APIs are:

Get profile
Create Room(with at least one invited user)
Invite

⚠️ The issue is particularly serious with the create room API (eg. with "invite": [@alfotest:some]) since even if the clients gets back an HTTP 502 and a HTML response, the room creation actually succeeds.

Steps to reproduce

Open https://matrix-client.matrix.org/_matrix/client/v3/profile/@alfotest21:some

Homeserver

matrix.org

Synapse Version

{"server_version":"1.80.0 (b=matrix-org-hotfixes,ab0a5f1972,dirty)","python_version":"3.8.12"}

Installation Method

I don't know

Database

I don't know

Workers

I don't know

Platform

I don't know

Configuration

I don't know

Relevant log output

None

Anything else that would be useful to know?

No response

clokep commented 1 year ago

Is this only an issue with the matrix.org deployment? I'm pretty sure we no longer return HTML error pages inside of Synapse itself.

alfogrillo commented 1 year ago

Is this only an issue with the matrix.org deployment? I'm pretty sure we no longer return HTML error pages inside of Synapse itself.

I tried that just under matrix.org tbh

clokep commented 1 year ago

Is this only an issue with the matrix.org deployment? I'm pretty sure we no longer return HTML error pages inside of Synapse itself.

I tried that just under matrix.org tbh

Right, do also see this on element.io or a test server?

alfogrillo commented 1 year ago

Right, do also see this on element.io or a test server?

Tried the get profile on element.io and the response looks good:

HTTP 502 - Bad gateway

{
  "errcode": "M_UNKNOWN",
  "error": "Failed to fetch profile"
}

In case of the create room:

HTTP 502 - Bad gateway

{
  "errcode": "M_UNKNOWN",
  "error": "Can't connect to server someone"
}

(and the room created anyway afterwards)

reivilibre commented 1 year ago

It'd be worth noting that we did have a chat about this in a meeting recently.

Whilst it's undeniably CloudFlare's fault for making a HTML error page from 502s, this can't be turned off as far as we know.

For the more severe case of creating a room and failing to invite a user, then we currently half-create the room then return a 502 error. Regardless of this error being HTML or JSON, a reasonable client would try to re-create the room. Possibly we should just fully create the room and ignore the error.

clokep commented 1 year ago

Oh -- is the complaint that Synapse returns 502 errors which Cloudflare turns into HTML pages?

reivilibre commented 1 year ago

Oh -- is the complaint that Synapse returns 502 errors which Cloudflare turns into HTML pages?

That's half of it, which I think we do want to consider (possibly at a spec level :/); one suggestion might be to designate another error code as 'a remote homeserver couldn't answer'.

The other half of the issue (room creation) is likely related to https://github.com/matrix-org/synapse/issues/8895: we half-create the room but then lie and tell the client the room creation failed (causing any good client to retry, particularly for a 502 which sounds like a reverse proxy-generated error).

jellykells commented 1 year ago

I'm not sure it makes sense to change Synapse's behaviour and certainly not the Matrix spec merely due to dubious behaviour by third parties.

It may be more worthwhile to ensure clients are robust enough to not choke on unexpected or invalid response bodies. (In this case since the error code is still correct, it seems like a minor issue)

reivilibre commented 1 year ago

On the topic of invitation failures during room creation:

It's been noted that rolling on with the room creation and returning a 200 (which is the only current response that informs the client about the created room) would be no better (particularly in DMs), since the end result would be the application accepting messages it has no hope of delivering. Whilst it's true that the broken rooms created today can also accept messages, either the user will see some messy 502 error at creation (unfortunately not specific enough to understand), or with the retry scheme: the appearance of many rooms, that they would notice that something is wrong/unreliable/suspicious. Particularly in an emergency situation, this would give them the chance to try another communication channel since it at least doesn't give any false confidence.

That sets the requirements for an improvement to be something like this:

Must give some sort of error feedback to the user for a failed invite — so just hiding the nasty error won't do
The error should be both machine-readable and user-readable. Moreover, it should be specific.
The error shouldn't be swallowed by meddlesome middleboxes — so it can't use the 502 error code (there may be other codes we can't use too).
If a room is created, the room ID should be passed to the client, as long as the client is aware that it's not correct. (this gives the opportunity for the client to offer the user the chance to fix the invite, e.g. if 25 invitations are sent and the last one was wrong, we don't want to destroy the room. We'd instead prefer that the user has the chance to issue an /invite to rectify the situation.)

Given that, we'd likely want to spec something new here to represent a partial success. Here's a rough outline for what I'd do for this exact case, which shouldn't be much server-side work:

Define new HTTP status code 'partial success' or whatever you want to call it, let's say 600 for the sake of illustration  — I don't know if there's anything suitable already defined in the HTTP standard (research required).
Don't let invites cause a 502  (hopefully goes without saying) — continue processing the room creation despite invitation failures
If any invites fail, instead return 600 with the room_id as though it was a 200, but additionally define new field failed_invites: ["@bob:bob.com"] or failed_invites: {"@bob:bob.com": {errcode: M_<standard error code>, errmessage: "could not reach homeserver"}} (or similar) so the client can give direct and helpful information

As a future extension, it would be good to consider how we can represent a total failure caused by an invitation failure. What I mean is: if every single invite fails (e.g. it's a DM and the only invite fails), it'd be good if Synapse was allowed to destroy the room since it was never created usefully in the first place. To recap the above issues, we can't return a 502 for this because meddlesome middleboxes like CloudFlare will turn it into HTML error pages (and it's also ambiguous considering that reverse proxies give 502s when the homeserver itself is down). We need to choose a different HTTP error code. Finally, it'd be good if the error response also included the failed_invites section here so the client knows that's the reason and can prompt the user intelligently about it.

Backwards compatibility concerns: I'd expect existing clients to treat 600 or the other 'total failure' error as error conditions and do the same thing they do today. But such a scheme would give them the chance to give the user clearer error messages or even friendlier error handling. Today, clients have no choice with the information they receive.

I'd appreciate feedback on the above scheme. Ideally we can keep this light enough so it's doable without a lot of fuss (otherwise it'll never get done). Introducing a more rigorous error framework to Matrix is somewhat of a wish of mine but it'd be too heavyweight for this. On the chance that anyone fancies thinking through this and taking it to an MSC, I'd be grateful so please feel free.

For everything else: the 502s being swallowed by CloudFlare are a pain, but if they don't cause any real pain then the client handling should be vaguely similar. Maybe we can consider changing error code if it would be useful (this would be a different MSC to the one above), but if you make a /profile query then as a client you should probably retry it whether it was your homeserver that was down or the remote homeserver that was down.

MadLittleMods commented 1 year ago

On the topic of invitation failures during room creation: [...] It'd be good if Synapse was allowed to destroy the room since it was never created usefully in the first place.

I think transaction rollback would be the most ideal solution. I'm guessing it's a bit harder to rollback here since invites to remote users go over federation?

Can we consider certain invite failures to be completely dead in the water? For example, if we get back a non-successful status code for all invites or fail to connect to all remote servers, then we can rollback. We can't make these guarantees for no response or a timeout though as the server could have received the message and will process it later.

In those cases, we would still end up needing some partial success like you're proposing anyway but we could be seamless in a lot of cases 🤷

Rollback is close to my heart because of problems like https://github.com/matrix-org/synapse/issues/15005 where we get a half-baked room and an alias pointing back at it from /createRoom

reivilibre commented 1 year ago

I think transaction rollback would be the most ideal solution.

Yeah, I agree there. Without having looked into it deeply, I'd say I expect it to be quite tricky to do, though — possibly made easier by the new batch event persistence though?

matrix-org / synapse