Cleaning up dead servers

Ekleog commented 6 years ago

I just joined #social:matrix.org, and tried to have a look at synapse's logs while it was painfully trying to join.

Basically, what I saw most often was messages like these (repeated):

synapse.http.matrixfederationclient: [] {PUT-O-120641} Sending request failed to 4ray.co: PUT matrix://4ray.co/_matrix/federation/v1/send/1522796890812/: ConnectionRefusedError('Connection refused',)
synapse.http.matrixfederationclient: [] {PUT-O-121051} Sending request failed to jomo.tv: PUT matrix://jomo.tv/_matrix/federation/v1/send/1522796891206/: NoRouteError('Network is unreachable',)
synapse.http.matrixfederationclient: [] {PUT-O-121337} Sending request failed to thebeckmeyers.xyz: PUT matrix://thebeckmeyers.xyz/_matrix/federation/v1/send/1522796891492/: TimeoutError('',)
synapse.http.matrixfederationclient: [] {PUT-O-120644} Sending request failed to matrix.home.ansorg-web.de: PUT matrix://matrix.home.ansorg-web.de/_matrix/federation/v1/send/1522796890815/: DNSLookupError('no results for hostname lookup: matrix.home.ansorg-web.de',)
[a python stacktrace for BadSignatureError: Signature was forged or corrupt]
synapse.federation.federation_base: [GET-426682] Signature check failed for $15000398313789kDXuE:half-shot.uk

So all these messages have something in common: they relate to servers that appear to no longer be alive.

I don't think Matrix currently has any way of saying “this server is really dead, so let's just force-part all its users and forget about it”. I think (without having measured, again), that this kind of waiting for a big number of timeouts may be one of the things that's slowing joining a big room.

Maybe it would be reasonable to say that after all the servers currently alive in a room have seen a server for the last time 3 months ago (number to adjust, but 3 months seems enough to avoid catching accidental downtime to me), then the server is declared dead.

Dead servers would be marked as such, and no signature checking attempt would be tried for them (as another server could come in after that and re-buy the domain name), newly joining servers would be warned not to try to contact them, etc.

Does what I'm saying make sense? I'm basically thinking that:

exponential backoff is nice but won't help for newly-joining servers
at some point a server could just die and the DNS be re-bought, and then we don't want to come back a year later and assume the new owner is the same as the previous one

What do you think about it?

maxidorius commented 6 years ago

Let's say that the concept is acceptable, which entity would mark servers as dead? on which criterias exactly?

ara4n commented 6 years ago

This is basically https://github.com/matrix-org/matrix-doc/issues/564

richvdh commented 6 years ago

closing as a dup

Ekleog commented 6 years ago

@richvdh @ara4n Well, if the closing is wrt. matrix-org/matrix-spec#117 it's the same underlying issue that caused discovery, but I don't think these are duplicates: the proposed solution of matrix-org/matrix-spec#117 is noting in the room state which servers are available or not, and the proposed solution of this issue is to actually consider dead servers as dead.

One big difference is in the expected behaviour: I believe a dead server's users should be force-parted (kicked?) from all the rooms they are, because there is an important risk someone will just buy the domain and have access to private channels and history from the now-dead server. Which is a huge privacy risk in my opinion.

This is not solved by matrix-org/matrix-spec#117, which would fix the performance issue though (and would be a nice in-between for [0 days of downtime, 3 months of downtime]).

@maxidor I was thinking each server would remember when other servers were reachable, and when a server pushes a message “this server is dead”, other servers check it's dead to them too, and accept the message only if so and no other server sent a message “no it isn't” (to accomodate for newcomers who could not know yet whether the said server is dead or not, and will only have to trust the others).

When this message is accepted, all the users of this server for the room are force-parted.

It may be needed to elect another admin for the room (if the only ones were on the now-dead server), in which case I guess the election could happen based on some PRNG seeded on the state of the room until the last time the server was reachable (so that no server could cheat the PRNG by just stuffing some messages just before deadline expiration, and so that all servers agree on who is the new room admin).

I'd say a “this server is dead” message should be accepted if it hasn't been reachable for at least 3 months (or hasn't been reachable since joining the room for room members present for less than 3 months), and should be sent if a server hasn't been reachable for at least 3 months and 10 days (to allow for some time in case all servers don't agree exactly on the last date of reachability, and 10 days should be way more than enough).

Does what I'm saying make sense? (also, do you get why I think it's not the same thing as matrix-org/matrix-spec#117, although the performance issue is the same?)

ara4n commented 6 years ago

yeah, the reason i didn't close it earlier because the proposal (boot dead servers) is different to the track retry hints idea. a quick heuristic to do this could be to have the HS whose user created the room be responsible for tracking server health and kicking out users from dead servers, based on a state event in the room to define the kick policy. This is quite similar to previous proposals to have a state event in a room to define whether users should be kicked after being absent (as opposed to their servers being dead).

Ekleog commented 6 years ago

Hmm, a question in this case is, what if the HS on which the room was created is the one that dies?

I guess the best behaviour is to just close the room anyway, as its UUID could be re-taken by a replacing-with-same-DNS HS, which would cause conflicts and tears, but it sounds like a bit of a loss in the decentralization promise of matrix: if I create a room for a project, pass the admin rights to someone else then move on and someday stop maintaining my matrix server, I wouldn't expect the project to be disturbed by this.

Actually maybe that's a separate issue that would need to be discussed elsewhere, like a way of changing the server to which a room's UUID is linked? There's already the possibility of adding a #foo:bar.example.org alias to a !baz:quux.example.org room (if I remember correctly), but changing the !baz:quux.example.org to !something:else.example.org could solve this non-decentralization issue by changing “the HS whose user created the room” to “the HS on the right-hand-side of the UUID of the room”, which could then be changed with this additional proposal?

maxidorius commented 6 years ago

@Ekleog Due to basic structure of Matrix and how data "packets" are actually weaved together and some basic definition and rules of what makes Matrix, there is no good solution short of either:

The HS(s) of the admin(s) do take care of cleaning, and only them
Making a new room

The ideas that related to anything time-based are inherently doomed to fail or be broken in decentralized systems, or in network where it's acceptable to have downtime (like Matrix). so any kind of "let's throw a bottle in the sea and see what comes back" is not really going to work overall or will be super complex. As for copying room into another, there are open tickets about it yes, but it's again such a complex mechanism (and so prone to abuse) that it's almost impossible to do right. And finally, room IDs are opaque identifiers - you shouldn't parse them, ever. The domain name is just used a namespace.

So... yes, I would love something like this, but I believe it's up to good room management, like any other good management of online communities. And it obviously starts with adapted tools for the job and proper community managers.

matrix-org / matrix-spec

Cleaning up dead servers #275