jointakahe / takahe

An ActivityPub/Fediverse server
BSD 3-Clause "New" or "Revised" License
1.1k stars 84 forks source link

username uniqueness interpretation issue between database and code #664

Open osmaa opened 8 months ago

osmaa commented 8 months ago

Encountered an error, extract from Sentry below:

Identity.MultipleObjectsReturned get() returned more than one Identity -- it returned 2! 
users/models/identity.py in by_username_and_domain at line 412
407                        username__iexact=username,
408                        domain_id=domain,
409                        local=True,
410                    )
411                else:
412                    return cls.objects.get(
413                        username__iexact=username,
414                        domain_id=domain,
415                    )
416            except cls.DoesNotExist:
417                if fetch and not local:

https://github.com/jointakahe/takahe/blob/main/users/models/identity.py#L412

the object in question: an actor presented in the database twice, once with Capitalized username, once with lowercase. Hard to say why the remote server changed the username presentation, but that's out of our control. The search above uses iexact (case insensitive) search, while the username column uses PostgreSQL's default collation, which is case sensitive, so the database didn't enforce the uniqueness as the code expected.

I couldn't find a WebFinger spec mention of username case sensitivity, but Mastodon treats them as case insensitive - so the code appears correct, but the database schema is not.

Possible solutions:

  1. redefine the users_identity (username, domain_id) unique index as (lower(username), domain_id), and quite possibly also a unique lower(domain) index for users_domain (the primary key index won't guarantee case insensitivity) as well as changing the service_domain unique index to lower(service_domain)
  2. create a custom, "nondeterministic" collation in the database init and use that for the column(s) which are expected to be case-insensitive, alter all relevant columns
  3. enforce the same with insert triggers
andrewgodwin commented 8 months ago

This is meant to be enforced at the code level, in that it's meant to try and look up usernames before insertion to see if they exist (that's what that code block supposedly does) but the transaction isolation clearly isn't good enough to actually pull that off.

It might be worth changing the unique indexes to enforce this just as a backstop against transaction race conditions, though those will also need to be fixed to actually work fully, or users will get error pages when this happens rather than polluting the database as they do now (which is an improvement, but not a total fix).

osmaa commented 8 months ago

The transaction isolation is good enough to allow this situation, because two concurrent transactions can pass the code block without running into each others' records prior to commit -- and some of the fetch_actor related paths take fairly long (especially before the sync_pins part was isolated). Because the index doesn't enforce case-insensitive uniqueness, both records get commited.

Changing the indexing will enforce the uniqueness independent of transaction isolation, so I'd say that would be the correct action. Users shouldn't see error pages, because it will enforce just one record existing, and the iexact search will then find it. If errors occur, they'd be occuring in stator, not the the UI, and a TryAgain should resolve them as one of the commits should remain..

A clean migration is difficult, though. Many tables refer to users_identity, and if data loss was to be avoided, all those references would need to be updated to point to the record which will be left after deleting duplicates and adding the new unique constraints.

Or, if we simply accept that this bug will cause (recoverable) data loss for remote identities, cascade-delete all the offending records first, and then change the indices.

In any event, while I know how to do this in SQL, I don't know how to do it in Django migrations...

I deleted the offending record earlier, but I still appear to have one which wasn't caught before. The second case appears to be a bunch of identity records will NULL usernames and domain_ids - a separate issue, perhaps.

takahe=# select count(id) dupes,lower(username),lower(domain_id) from users_identity group by lower(username),lower(domain_id) having count(id) > 1;
 dupes |   lower   |  lower   
-------+-----------+----------
     2 | fediverse | lemmy.ml
   797 |           | 
(2 rows)

takahe=# select id,actor_uri,username,name,domain_id from users_identity where lower(username)='fediverse' and lower(domain_id)='lemmy.ml';
         id         |          actor_uri           | username  |   name    | domain_id 
--------------------+------------------------------+-----------+-----------+-----------
 197408691327720714 | https://lemmy.ml/u/Fediverse | Fediverse |           | lemmy.ml
 201262153399299226 | https://lemmy.ml/c/fediverse | fediverse | Fediverse | lemmy.ml
(2 rows)
create temporary table migrate_users_identity
as
with b as (
    select min(id) id, lower(username) username, lower(domain_id) domain_id from users_identity
    group by lower(username), lower(domain_id) having count(*) > 1 
)
select a.* from users_identity a join b on (
    lower(a.username) = b.username and lower(a.domain_id) = b.domain_id 
    and a.id > b.id);

delete from users_identity a using migrate_users_identity b
where a.id = b.id;

create unique index lowercase_actor on users_identity (lower(username),lower(domain_id));

drop table migrate_users_identity;
andrewgodwin commented 7 months ago

Yeah, this is a pretty nasty problem, because if you take the "correct" activitypub view - that everyone is identified purely by their Actor IRI - then it's actually fine to have two things on the same domain with the same username, as those aren't really first-class concepts in AP (and mostly come from WebFinger and the username field on Actor and a guessed domain).

It might be worth re-examining the whole usage of this function - which really should just be to link mentions and power searches - and see if it's acceptable to have multiple results and just take the newest one.

osmaa commented 7 months ago

Right.. So this becomes a question of is Takahe an ActivityPub implementation (thus keying on actor_uri), or a "fediverse" implementation where both WebFinger and ActivityPub are relevant and discovery is based on WebFinger. I'd argue that in the latter case, though the specs are unclear, we should fall back on convention and expect that username local-parts are case insensitive. In any case, I've been running my instance with the additional unique lowercase index on users_identity for the last couple of weeks with no noticeable adverse effects.

osmaa commented 7 months ago

I pinged the fedi wisdom on this, but so far haven't gotten any closer to a clear spec. Apparently it is undefined whether acct scheme usernames should be treated case insensitive or not (but Mastodon is case insensitive), and failing any specs to the contrary, the default for URLs (which "all" of the fedi servers use for actor id IRIs) will have case-sensitive path parts. Lemmy is particularly stupid by not mapping a canonical form of id even through its WebFinger endpoint.

https://mas.to/@osma/111504184001363691

andrewgodwin commented 7 months ago

Yup, welcome to the world of trying to get fediverse agreement on things!

At its heart, Takahē is much more of an ActivityPub server than specifically caring about WebFinger, so I would rather keep things centered around a canonical Actor IRI and then if we need username mapping, doing it case-insensitive as WebFinger would on Mastodon.