matrix-org / matrix-federation-tester

Tester for matrix federation written in golang.
77 stars 17 forks source link

Version request is not sent to the correct host #99

Closed KoltesDigital closed 1 year ago

KoltesDigital commented 4 years ago

I'm serving a website at example.com, and Synapse at matrix.example.com. For some reasons, they are on two different servers so I can't serve Synapse on example.com:8448. Therefore I have a SRV entry to indicate the actual subdomain. The federation seems to work: FederationOK: true.

However the version request is made to the main domain instead of the subdomain, and therefore fails.

Expected behavior: the same protocol as for getting the federation, i.e. sending the request to the resolved IP with the header host: <original domain>.

Originally discussed in https://github.com/matrix-org/synapse/issues/6710.

This is somewhat similar to https://github.com/matrix-org/matrix-federation-tester/issues/98, but that issue do not use SRV.

richvdh commented 4 years ago

It's worth emphasising that this is specific to the version check - I missed that at first.

It looks like https://github.com/matrix-org/gomatrixserverlib/blob/7ea554ef840a2ef041997303f72337f71eca7ddd/client.go#L371 is making a regular HTTP request rather than using the federation client routing.

Omar007 commented 4 years ago

I think it's not just the version check. I have a similar setup and I'm seeing requests for 2 endpoints destined for the servername instead of to the defined URL in the SRV record; /_matrix/federation/v1/... and /_matrix/key/v2/....

However, it also seems this behaviour in this tester might be accurate. It looks like Matrix itself is actually doing the same thing.. Whether that behaviour is correct in the first place is a different/broader discussion I suppose. The reason for this latter conclusion is that when I check an external server for channels using the explore functionality within Element, it fails to list rooms (showed a key related error) unless I set up the server behind the servername location to re-route requests for those 2 path prefixes to the actual homeserver location.

richvdh commented 3 years ago

@Omar007 your description is unclear, but if the federation-tester is doing the same thing as synapse, that is correct behaviour. I suspect you are misunderstanding how SRV records work.

Omar007 commented 3 years ago

I'm not sure what you mean with 'unclear' but I'll try to reiterate; I'm explicitly stating that the federation tester isn't wrong/diverging from matrix here and this is thus (most likely) not a tester problem. Additionally, I'm noting that what the OP noticed isn't just limited to the version endpoint and that whether or not that behaviour is correct or not in the first place is a different/broader discussion (imo); a discussion at the spec/matrix implementation level, outside of the scope of the tester. This has nothing to do with the tester or understanding how the SRV record works but more with how/when it is utilized (or in this case, not) for specific parts of the matrix specification.

richvdh commented 3 years ago

The reason your description is unclear is that you don't say if you are considering the Host header or the IP address, so let me try to be explicit.

Suppose you have the following DNS records:

example.com         A   10.1.1.1
matrix.example.com      A   10.2.2.2
_matrix._tcp.example.com    SRV 10 5 8441 matrix.example.com.

(Let's also assume for simplicity that there are no .well-known/matrix/server files which add an extra layer of complexity.)

The expected behaviour is that federation requests are sent to 10.2.2.2:8441, with a Host header (and TLS SNI) of example.com. This surprises many people, but is the specified behaviour, with good reason.

As I understand it, the bug here is that the /version request is sent to 10.1.1.1:443 (with the correct Host: example.com).

I believe that is specific to the /version endpoint, it is diverging from the the matrix spec, and it is a federation tester problem.

Omar007 commented 3 years ago

As I understand it, the bug here is that the /version request is sent to 10.1.1.1:443 (with the correct Host: example.com).

I believe that is specific to the /version endpoint, it is diverging from the the matrix spec, and it is a federation tester problem.

If that is the case then fair enough, then that is probably indeed a tester only problem. That is not how I interpreted the OP originally though but that'd then be my mistake if that was what was implied there. Reading just the post and disregarding the references/links I can indeed interpret it as you describe so I must've been completely thrown off by the links to the other issues most likely.

I'm afraid it's been to long ago for me to recall what I was seeing i.r.t. those other 2 endpoints exactly though (it's been half a year..) and if they did differ in that regard or if it was just the end result showing similar behaviour.

The expected behaviour is that federation requests are sent to 10.2.2.2:8441, with a Host header (and TLS SNI) of example.com. This surprises many people, but is the specified behaviour, with good reason.

I am aware that this is the specified behaviour and I was under the impression this issue was in relation to that but as I said, that may have been an interpretation error on my part. Something about an ass, u & me I suppose haha.

That said, while knowing that that is the defined behaviour, I was not aware why, which is why I said "Whether that behaviour is correct in the first place is a different/broader discussion I suppose.". But if that is with good reason and that is also documented, I'll assume there isn't anything left to discuss on that subject either (though it does seem to complicate homeserver setups for people or cause confusion so I suppose it should maybe be made more obvious/clearer or something /shrug)

Omar007 commented 2 years ago

So another half a year later, I figured I'd play around with Dendrite for a bit and run a deployment on a Kubernetes platform. Good news(?); I ran into the same problem with the endpoints I mentioned before.
I still think based on the previous messages and the new server implementation showing the exact same behaviour again, it's more than likely completely as-designed behaviour. To say that makes it actually intuitive/self-evident/obvious/easy to actually have your server name address and host address actually differ? Not really...
At least not when, unlike what the OP seems to suggest, both systems aren't separated and uniquely identifiable systems at the DNS record / IP level. If they are, it doesn't seem like it should be a problem and I don't seem to be able to reproduce the problem in a localized/internal test setup where I can just assign different IPs. (so in that context I'll not be able to provide any info and we can only hope the OP returns at some point)

The problem in my case seems to be originating from the fact that there is only a single 'front door' from the perspective of initiating the connection; the IP that leads to the Ingress Controller on the cluster which is directing the requests into the cluster to backing pods/containers.
As a result, not only the .well-known requests that are to be destined for the server name address are routed there by said controller, the aforementioned endpoints are as well. Based on the documentation I would not have expected these endpoints to also execute against the server name address instead of the resolved host address (through the SRV record or .well-known response). The Service Discovery documentation only ever talks about the .well-known endpoints. The other endpoints are all located in different chapters of the documentation.

Normally I'd probably say "it's just me having a weird/exotic setup or being stupid" but looking at how often issues pop up around here related to the split server name<->host address setup, I think it's not just me haha.
It'd probably do well to somehow have a clear overview of what endpoints will be accessed using which DNS record and which Host: value.

Context on the setup:
Ingress host: matrix.example.com
Server name: example.com
SSL certificate presented at matrix.example.com is valid for both itself and the server name example.com
DNS records:

example.com         A   192.168.1.1
matrix.example.com      A   192.168.1.1
_matrix._tcp.example.com    SRV 0 0 443 matrix.example.com.

All of that said, under the assumption it's indeed as-designed, I personally still think this is a very weird design decision to require the server name's address to still handle matrix traffic itself after already knowing/resolving the actual host address and also already proving that you own the relevant domain names by having the host address present a certificate valid for both the server name address and host address because that basically just means that you have to run the server at the server name address as well regardless of having/wanting it running at the host address only. :man_shrugging:

EDIT: For the record, I ended up deciding to just cover all the endpoints in the whole of chapter 3 (https://matrix.org/docs/spec/server_server/latest#server-discovery) and not just the endpoints covered in chapter 3.1 which explains how matrix handles the discovery. Fairly sure this is the same as what I did at least a year or so ago now with Synapse.
Since I do have the possibility to access and capture traffic at that level and not just the DNS record for it I can do so and that seems to be sufficient but I have nothing to base that on other than "it works for me" and passes the federation test. So at the very least it's opened up wide enough but if it's too much, I don't know atm.

This is something that's nicer with Dendrite in polylith mode as at least it only means the federation-api component/container is represented/exposed at multiple addresses, not a container running the whole server, but it still means you actually need to run (part of) matrix on the server name address at all times. You can't just rely on the fallback to the SRV record or only having the .well-known endpoints captured at the server name address.

strifel commented 1 year ago

I ran into the same problem. I have not setup the well-known as there is no webserver running on port 443.

Federation works fine, but the tester can not load the version. It can do the Checks.

From looking at the code I think the version lookup is before the srv lookup?

richvdh commented 1 year ago

It looks like https://github.com/matrix-org/gomatrixserverlib/blob/7ea554ef840a2ef041997303f72337f71eca7ddd/client.go#L371 is making a regular HTTP request rather than using the federation client routing.

This is bogus; that commit of gmsl was never merged (it was an intermediate commit in https://github.com/matrix-org/gomatrixserverlib/pull/121). The correct code is https://github.com/matrix-org/gomatrixserverlib/blob/fb4e807/client.go#L388, which looks like it uses the correct routing.

richvdh commented 1 year ago

I believe this is fixed. If anyone is still seeing it, please let us know, citing an actual hostname (not just example.com) so that we can investigate.