[feature] Method of "healing" federation with botched deployments / deployments with changed host vs account-domain

tsmethurst commented 2 months ago

Getting lots of the following:

error dereferencing account https://example.org/whatever: enrichAccountSafely: error getting account https://example.org/whatever from database after race: sql: no rows in result set

"Lots" as in "around once every ten minutes" which is more than I'd expect to see. The "no rows in result set" suggests that something is going wrong here, since if the account is in the database we should be able to select it by the URI.

I've also seen this on gts.superseriousbusiness.org where if I put an account URI into search, I get this error logged and no results, whereas if I put the @whatever@example.org equivalent in search it works.

tsmethurst commented 2 months ago

I'm investigating these. Some are the results of people clearly doing botched deployments or changing their host + account domain after already federating for a while, so they can largely be written off. I'll see if there's any more interesting ones to look at.

tsmethurst commented 2 months ago

For example there's an instance called strafpla.net which clearly used to be deployed at just strafpla.net and then redeployed at mstdn.strafpla.net a bit later, which results in these messed up entries in the database for goblin.technology:

So now when my instance dereferences (for example) the instance actor, the following happens:

it tries to select an account with the uri https://mstdn.strafpla.net/actor from the database to see if we already have a copy stored
it fails to retrieve it (because it's stored in the database as https://strafpla.net/actor)
and so it thinks it's a new account
since it's a "new" account, it goes and does all the appropriate dereferencing and then tries to insert the "new" into the db
this insert fails on the username domain uniqueness constraint (which is good, this should fail)
this db.ErrAlreadyExists propagates up to enrichAccountSafely, which says "aha, this must be a race condition, I'll try to select the account by URI then, clearly someone inserted it first"
this select by URI also fails because it's still trying https://mstdn.strafpla.net/actor

tsmethurst commented 2 months ago

I suppose we could introduce some logic to try to "heal" such bodged deployments in our database, I'm not sure of how exactly we should go about this though.

tsmethurst commented 2 months ago

Mmm indeed after looking through logs a bit more I haven't found any examples yet that don't point to changing deployment, deploying and then wiping and deploying again, etc.

I'll change this to a feature request to consider a way of healing federation with botched deployments.

tsmethurst commented 2 months ago

We could probably do something like a complete "purge" command where we not only block a domain and clear all relationships with accounts on that domain, but also fully delete account entries from that domain, instead of just stubbing them, to allow it to refederate as though it were a completely new domain.

superseriousbusiness / gotosocial

[feature] Method of "healing" federation with botched deployments / deployments with changed host vs account-domain #3264