Closed evanp closed 8 months ago
Per RFC 7565 linked above (and its dependency RFC 3986):
acctURI = "acct" ":" userpart "@" host
userpart = unreserved / sub-delims
0*( unreserved / pct-encoded / sub-delims )
The username starts with at least 1 character from unreserved
or sub-delims
, followed by between 0 and infinity characters from the set of unreserved
, pct-encoded
, and sub-delims
.
For reference:
unreserved
= ALPHA
/ DIGIT
/ "-" / "." / "_" / "~"
[A-Za-z0-9~_\-\.]
sub-delims
= "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
[!&',;=\$\(\)\*\+]
pct-encoded
= "%" HEXDIG HEXDIG
%[0-9A-F]{2}
HOWEVER: Mastodon further constrains usernames with its USERNAME_RE
regex and its UniqueUsernameValidator class.
Looking at USERNAME_RE:
USERNAME_RE = /[a-z0-9_]+([a-z0-9_.-]+[a-z0-9_]+)?/i
Here we see that Mastodon only allows alphanumeric and underscores. It further allows dots and dashes in the middle of the username, but not at the beginning or end. In ABNF, this would be like:
username = word
*( rest )
word = ALPHA / DIGIT / "_"
rest = *( extended )
word
extended = start / "." / "-"
The username is matched case-insensitively, and case-insensitivity is further enforced by the UniqueUsernameValidator:
normalized_username = account.username.downcase
normalized_domain = account.domain&.downcase
scope = Account.where(
Account.arel_table[:username].lower.eq normalized_username
).where(
Account.arel_table[:domain].lower.eq normalized_domain
)
Or, in other words, Mastodon will check that the downcased username and domain do not already exist in the local database.
Misskey enforces a 1-20 \w
(word characters, equivalent to ALPHA
/ DIGIT
/ "_") limit on local usernames:
There is also a limit of 128 characters on usernames, and 128 characters on host:
@Column('varchar', {
length: 128,
comment: 'The username of the User.',
})
public username: string;
@Index()
@Column('varchar', {
length: 128, nullable: true,
comment: 'The host of the User. It will be null if the origin of the user is local.',
})
public host: string | null;
When searching for usernames, Misskey enforces the same downcasing behavior as Mastodon:
a 20 character limit seems excessively small.
On Wed, Feb 7, 2024 at 1:12 AM a @.***> wrote:
Misskey enforces a 1-20 \w (word characters, equivalent to ALPHA / DIGIT / "_") limit on local usernames:
— Reply to this email directly, view it on GitHub https://github.com/swicg/activitypub-webfinger/issues/9#issuecomment-1931357353, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABZCV3S5DOELDFAKHJP6YTYSMLLLAVCNFSM6AAAAABC22JUSGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZRGM2TOMZVGM . You are receiving this because you are subscribed to this thread.Message ID: @.***>
it's only 20 for local usernames. although i think if i'm reading this correctly then the database limits in misskey's schema are:
Appears Pleroma does this for remote users: https://git.pleroma.social/pleroma/pleroma/-/blob/develop/lib/pleroma/user.ex#L519
|> validate_format(:nickname, @email_regex)
The regexes used are like so:
@email_regex ~r/^[a-zA-Z0-9.!#$%&'*+\/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$/
@strict_local_nickname_regex ~r/^[a-zA-Z\d]+$/
@extended_local_nickname_regex ~r/^[a-zA-Z\d_-]+$/
So it appears to be more permissive than Mastodon -- alphanumeric plus more symbols (not just dots dashes and underscores).
For local users, it appears you can have dashes and underscores at any point in the username, including at the beginning or the end.
As a consequence: it appears a username of -
will be valid in local Pleroma but not in remote Mastodon?
Not entirely related, but RFC 2821 (for SMTP) describes email addresses as having a max 64 characters for the local-part and 255 characters for the domain.
Pixelfed allows alphanumeric, dots, dashes, and underscores, but will skip creating or updating a profile for a username containing any other character:
https://github.com/pixelfed/pixelfed/blob/dev/app/Util/ActivityPub/Helpers.php#L771-L776
// skip invalid usernames
if(!ctype_alnum($res['preferredUsername'])) {
$tmpUsername = str_replace(['_', '.', '-'], '', $res['preferredUsername']);
if(!ctype_alnum($tmpUsername)) {
return;
}
}
Kitsune allows characters beyond those set by the RFC (although hidden behind a feature flag exactly because of that reason).
We technically allow all unicode characters and unicode digits (because who wouldn't want their username to be all written in Hangul?).
But again, just behind a feature flag. By default it only allows ASCII letters, numbers, underscores, dashes, and dots. So very much RFC compliant.
That makes it a little complicated.
Because WebFinger is mentioned above: I would not assume that the userpart
of WebFinger always matches the preferredUsername
.
a 20 character limit seems excessively small.
I think in some scripts it might seem unreasonably long!
I'd like to start boiling this down to recommendations for publishers and consumers to maximize interoperability.
For publishers:
For consumers:
What do we think?
I don't see why usernames can't be internationalised.
I have a demo user @你好@i18n.viii.fi
which is able to interact with some ActivityPub servers. There are rather a lot of people who don't use the Latin alphabet to write with. It would be great to allow them to use their own languages when writing.
Further thoughts and comments at https://shkspr.mobi/blog/2024/02/internationalise-the-fediverse/
Similarly, lots of people have really long names. I appreciate that services like email typically limit to a set number of characters - so I agree with @trwnh that it might be better to follow existing standards rather than arbitrarily set other limits.
For publishers: […]
- Maximum length of 64 chars (I think that's the low end limit I'm seeing here?)
For consumers: […]
- Maximum length 128 chars (?) or no limit (?)
It should be noted that if we are going to support internationalized usernames, the definition of char
would get less obvious: Is that byte-length of percent-encoded UTF-8 string? Or number of Unicode code points? Or number of graphemes? …
And if we adopt the byte-length of percent-encoded UTF-8 string as the definition of chars
, 64 chars
would be only capable of representing 7 Chinese characters (plus an additional unreserved / sub-delims
character), which doesn't look quite assuring to me.
It should be noted that if we are going to support internationalized usernames, the definition of "char" would get less obvious
If any limit would be applied, we should, in my opinion, use extended grapheme clusters since those are the closest us humans would interpret as "characters".
Unicode codepoints (i.e. scalars) are a little less accurate but more accurate than bytes, and bytes are just completely unsuitable to find the length of a string if you consider anything besides ASCII.
Edit: But this also needs some more careful implementation around the counting of characters, such as either a secondary byte limit or codepoint limit on the implementer side.
Reasoning behind this: Counting graphemes isn't the fastest operation you can perform. 500KB consisting of 1x "a" and ~600k zero width joiners
takes ~2ms in an optimized Rust build.
While this in itself doesn't seem too bad, it's still significant when comparing it to counting codepoint, which takes 45µs on the same payload.
Also this is a Rust implementation built with -O3
, if we take this to a Java or Ruby implementation, the speed will regress in a similar way, it will just enter unacceptable territory much faster.
Especially considering the ~600k codepoints only took up ~550KB in space and body limits are much larger than that (usually 4x larger), that can technically evolve into a DoS vector if not handled carefully enough.
The whole point of webfinger usernames is that people can use them in plain text environments to mention other users, and figuring out "which set of Unicode inputs can people easily type using their keyboard and IME" is an almost impossible to solve problem. And yes, although you're very glib about the risk in your blog post, there are very serious security concerns implied by nonprintables, uncanonicalized input, and confusables. Even URLs, which have an i18n standard, unlike webfinger, very rarely show the Unicode form to users. This is a hard-fought lesson from the browser security space—we shouldn't be so eager to throw away the same lesson here.
All accounts have a name
that can be translated and supports the full range of Unicode. Requiring a computer-readable and unique ascii username
as a secondary identifier is not a large imposition and it will be familiar to almost every frequent user of computing systems. I am not aware of any extant social network with wide adoption that has a concept of unique usernames and allows non-ascii usernames. One option would be to have a unique user ID or use a user's phone number for their identifier (like Signal), which would be a completely acceptable preferredUsername for webfinger resolution purposes.
I am not aware of any extant social network with wide adoption that has a concept of unique usernames and allows non-ascii usernames.
As I wrote in https://github.com/w3c/activitypub/issues/395#issuecomment-1787229756, Weibo is an example of such a platform. (It depends on the definition of wide adoption
, though.)
Edit: Weibo's 用户修改昵称规则 (User Nickname Modification Rule) states the following:
一、昵称修改格式要求: 4-30个字符,支持中英文、数字、下划线和减号 *注:一个汉字为2个字符
English translation:
1. Nickname modification format requirements: 4-30 characters, supporting Chinese and English, numbers, underscores and minus signs *Note: One Chinese character is 2 characters.
(I don't want to experiment with a real account as they, er, reviews profile changes for some reason, and they allow only few number of username changes in a certain period.)
Thanks. I wasn't sure if a weibo screen_name
was used as a unique identifier or not, since the main identifier used throughout user URLs is the uid
. But it appears that at least in some APIs the screen_name
is used as a helpful and complete identifier.
Frankly, I don't think that new systems should use webfinger at all, and I don't really relish the prospect of extending such a system to permit non-ascii identifiers before having more robust non-uniqueness guarantees in place.
It should be noted that if we are going to support internationalized usernames, the definition of "char" would get less obvious
Whatever fits into an SQL varchar column of length 128? At least, that's what Misskey seems to do, and I'm not sure if any other platforms have implicit limits on length like this. I assume that means unicode codepoints since most databases are encoded as UTF-8? (Note that I am not actually sure about this.)
The whole point of webfinger usernames is that people can use them in plain text environments to mention other users, and figuring out "which set of Unicode inputs can people easily type using their keyboard and IME" is an almost impossible to solve problem.
Terence's opinion in the blog post (Maybe not being found by people who can't type your language is a feature, not a bug
) might sound a bit radical, but I think it's actually a reasonable attitude to use a set of characters that's only easy to type for the user's primary audience. You might find the attitude exclusive, but I don't see how that's more exclusive than speaking a language that's alien to the majority of the world.
Also, the situation could be improved with the aid of a more clever autocompletion system, although that may mean a compromise for the design goal of being able to use them in plain text environments
.
Latin alphabet usernames with diacritical marks can trivially be suggested from their counterparts without the diacritical marks (e.g. @andre
to @André
). Non-Latin phonograms can be suggested from transliterated Latin alphabets (e.g. @akiko
to @あきこ
). Even Chinese characters (mostly) correspond to single pronunciations (at least in Mandarin Chinese), so they are likely to be suggested from transliterated Latin alphabets (e.g. @nihao
to @你好
).
The autocompletion only works if the server knows the target user, but you can copy-and-paste the username at the first time and you are not likely to need to repeat it after that, which I think isn't bad UX. (Even with ASCII Latin alphabets, it's not quite a good idea to try to type the username all by yourself instead of copy-and-paste-ing, if the autocompletion doesn't work.)
Admittedly, not all scripts can be trivially suggested from Latin alphabets. For example, many Japanese kanjis have many-to-many correspondence with their pronunciation and it's not as easy to suggest from Larin alphabets as Chinese hanzis. Converting the Mandarin Chinese username @你好
to @nihao
is easy, because "你" only reads "ni" and "好" only reads "hao" regardless of the context (if you ignore the tone), so all the server needs to do is to store the transliterated username ("nihao") for autocompletion purpose. On the other hand, converting the Japanese username @石井健蔵
to @ishiikenzou
isn't that easy, because "石" may read "ishi" or "seki" or …, "井" may read "i" or "sei" or …, and so on, depending on the context.
In systems that expect to process Japanese names, it's a common practice to ask users to input "phonetic" names along with their kanji names. Perhaps, ActivityPub actors can likewise have secondary usernames (that's used in discovery, but not for displaying purpose), using a mechanism like xrd:Alias
?
And yes, although you're very glib about the risk in your blog post, there are very serious security concerns implied by nonprintables, uncanonicalized input, and confusables.
IIUC, "homograph attacks" are only applicable if the target server of the attack accepts sign-ups by untrusted users. So I think it's fine to have an administrator configuration to opt into local non-ASCII usernames for single-user servers, and always accept remote non-ASCII usernames on the assumption that it's the remote servers' responsibility to restrict characters that may be harmful in their setup or otherwise moderate bad actors who pretend to be other users on the same server.
And as for nonprintable characters, RFC 7565 (The ’acct’ URI Scheme) warns about/forbids them:
https://datatracker.ietf.org/doc/html/rfc7565#section-5
Implementers are advised to disallow percent-encoded characters or sequences that would (1) result in space, null, control, or other characters that are otherwise forbidden, […]
https://datatracker.ietf.org/doc/html/rfc7565#section-6
Before applying any percent-encoding, an application MUST ensure the following about the string that is used as input to the URI-construction process:
- The userpart consists only of Unicode code points that conform to the PRECIS IdentifierClass specified in [RFC7564].
Even URLs, which have an i18n standard, unlike webfinger, very rarely show the Unicode form to users.
Is that really the case? Web browsers' address bar (which I think is one of the most prominent places where end users see URLs) display the Unicode form of an IDN. Also, as a layperson, I feel the punycode/percent-encoded form to be more confusing and harder to distinguish from similarly named labels, but that's irrelevant to this discussion.
Also, while I agree that the Latin script is the most widely accessible input method in a technical sense, I doubt if it's always a safe way of expressing one's name. For example, romanization systems of Chinese characters are subject to debate in Taiwan (although I'm not an expert of the issue):
https://en.wikipedia.org/wiki/Chinese_language_romanization_in_Taiwan
I have an instance of Mastodon where I test IDN domain names - see e.g. @north@ꩰ.com (where I have several patches applied to fix issues). I own quite a few single character IDNs, so I'm all too familiar with how browsers and other software render them. Mastodon, for example, won't convert https://ꩰ.com to a link; I have to instead use the canonical form xn--8r9a.com. (we'll see what GitHub does here... edit: good job GitHub!). Discord and Slack forcefully converts links to Punycode. Signal is quite happy with them.
Browsers have policies (you can see Chrome's at https://chromium.googlesource.com/chromium/src/+/main/docs/idn.md) that are fairly similar to each other. When it's not deemed unsafe to do so, such as when using Unicode "confusables", browsers will generally render IDNs as the Unicode character.
Yes. Browsers have a thousand lines of code dedicated to getting this right, at least 4 different spec or pseudo-spec documents (UTS 46, UTS 39, rfc5892, and the google IDN documentation), an always-on reporting service (Google Safe Browsing), global analytics (constantly updating cached list of the global top 500 sites) and on-device analytics (Site Engagement Service) to make IDN labels safe. I don't think the CG can in good faith recommend people go down the route of providing IDN identifiers unless they're willing to to implement checks at least as strict
For example, a popular attack on Mastodon a few years ago involved a webfinger identifier that looked like @.***(lots of empty / space unicode characters).evil.server", making an account from evil.server look like it was actually from mastodon.social (as long as the users didn't notice the ellipses). This same kind of confusability would also be able to be present in usernames if we relaxed this restriction.
On Sun, Mar 3, 2024 at 12:15 PM Jason Parker @.***> wrote:
I have an instance of Mastodon where I test IDN domain names - see e.g. @north https://github.com/north@ꩰ.com http://xn--8r9a.com (where I have several patches applied to fix issues). I own quite a few single character IDNs, so I'm all too familiar with how browsers and other software render them. Mastodon, for example, won't convert https://ꩰ.com to a link; I have to instead use the canonical form xn--8r9a.com. (we'll see what GitHub does here...). Discord and Slack forcefully converts links to Punycode. Signal is quite happy with them.
Browsers have policies (you can see Chrome's at https://chromium.googlesource.com/chromium/src/+/main/docs/idn.md) that are fairly similar to each other. When it's not deemed unsafe to do so, such as when using Unicode "confusables", browsers will generally render IDNs as the Unicode character.
— Reply to this email directly, view it on GitHub https://github.com/swicg/activitypub-webfinger/issues/9#issuecomment-1975234149, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABZCVYGUAGOHC64B44SZVDYWNLBZAVCNFSM6AAAAABC22JUSGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZVGIZTIMJUHE . You are receiving this because you commented.Message ID: @.***>
IDN has been a thing for about... a decade, and some non-Latin worlds have started using IDN for various purposes (this even includes traditional e-mail -- fortunately, IDN helped those countries to migrate from ugly DNS hacks installed on telco's servers to proper, standardized IDN)
Not allowing IDN and rendering them as ***@xn--9n2bp8q.xn--9t4b11yi5a
in the Fediverse will upset some local communities who are willing to run Fediverse software on their IDN, and I don't think it is a good thing to do.
Safe IDN is hard (because human language is HARD), but I think it is worth it.
Well, I feel like the IDN-specific discussion is getting off topic here. While IDNs have some similarities to internationalized acct:
usernames, not all the points stand for our topic. I think it's nice to clarify the similarities and differences.
As pointed out by perillamint, IDNs are already widespread and they have plenty of legitimate use cases, so just always displaying their gobbledegook forms would be troublesome. Admittedly, this argument doesn't stand for acct:
usernames, which don't have many real-world use cases yet. But if we need to implement an IDN display algorithm anyway, the point about the difficulty of implementing it weakens, since the logic is expected to be reusable for usernames.
And as I wrote in https://github.com/swicg/activitypub-webfinger/issues/9#issuecomment-1975138984, the scope of homograph attacks is different: while the IDN homograph attack allows a threat actor to impersonate someone else's (remote) domain, the attack against usernames is only applicable to local users, and the applicability depends on the server's setup.
What do we think?
there is
The userpart consists only of Unicode code points that conform to the PRECIS IdentifierClass specified in [RFC7564].
in https://www.rfc-editor.org/rfc/rfc7565.html#section-6 as @tesaguri and @trwnh already mentioned.
P.S.: preferredUsername
is mentioned as a property containing 'natural language' being subject for i18n: https://www.w3.org/TR/activitypub/#h-note-2
It's undefined how that relates to the webfinger local part.
I suppose that at least we all agree that the report shouldn't just advise every server to accept arbitrary Unocde usernames (including whitespaces, control characters, etc.) without any security considerations. But it's not obvious where the best compromise resides between the two extremes, i.e., accept everything vs. ASCII-only (though the latter is one of viable options). And I think the ambiguity is getting the discussion a little out of focus, as the security implications (among other concerns) would significantly differ depending on the specifics of the approach.
So I'd like to show my personal vision of the requirements to have regarding this topic (others may have different visions, of course):
Also, a guidance like the following may be a nice addition:
… which is analogous to Web browsers' behavior when an IDN doesn't pass the display algorithm.
Personally, I don't think we should have such a guidance because I believe the remote server should have the discretion/responsibility to determine what's appropriate for its users. For example, I've seen a person whose handle name is intentionally a mixture of Latin script and another kind of script, which isn't likely to pass an IDN display algorithm. If they launched a single-user server and set the handle name as the username of their account, there would be no problem as far as security is concerned. (Not that I personally admire it. It's a big headache for accessibility! But so is non-camelCased usernames, leet usernames, etc., and displaying percent-encoded forms wouldn't be a solution to accessibility concerns either.)
Also, by making it the responsibility of producing servers of non-ASCII usernames, consuming implementations don't need to pay the extra cost to aggressively reject potentially malicious usernames if they don't choose to produce non-ASCII usernames by themselves (though I believe that they should implement something similar for IDNs either way), while producing servers can opt into allowing non-ASCII usernames by implementing a precaution or simply limiting user registrations, which is a significant step forward from the status quo without sacrificing security.
isn't our case what https://www.unicode.org/reports/tr31/#Default_Identifier_Syntax could be a good fit for?
isn't our case what https://www.unicode.org/reports/tr31/#Default_Identifier_Syntax could be a good fit for?
Essentially, yes. This is what Kitsune is using already (well, almost. Kitsune uses the reduced \p{L}\p{N}
set, and allows dots and underscores because usernames).
This is a rather conservative set of characters and won't allow you to break everything with control characters.
If you then put it into a column using Level 1 Unicode collation, you've got a rather robust system against breakage and impersonation through confusables (since the collation will consider "a" and "ä" and "á" to be equal).
Edit: Here is an example of how the DIS would be implemented. It seems like this could work for Mastodon and Misskey usernames, but personally I'd prefer if there was a little more openness(?) for the usernames (mainly, I'd like to see support for dots, but that's just me, I guess 🤷♀️): https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=a2eb8118dacec801bc28f62ec3584699
what puzzles me, is that quite a few comments here read as if https://datatracker.ietf.org/doc/html/rfc7565#section-6 wouldn't exist.
If there are implementations that chose to ignore this standard (and sure there are and other standards as well), what do we expect from inventing yet another one?
edit: wrong section, off by one :-)
what puzzles me, is that quite a few comments here read as if https://datatracker.ietf.org/doc/html/rfc7565#section-6 wouldn't exist.
Well, that is for Webfinger only. Technically you could use something like punycode or URL-encoding within Webfinger.
There are quite a few ways to achieve both, arbitrary usernames and adhering to the acct
scheme restrictions.
But yeah, the combination of ^[\p{L}\p{N}\-\._]+$
(this is the regex form. Playground here: https://regex101.com/r/5s0q87/1) would be within the already defined standard (if I read it correctly), and would allow for all kinds of nice international usernames.
(Going off the assumption that I interpreted the standard right) It seems like even the implementations that already support i18n usernames follow the RFC you linked somehow.
Either by URL-encoding the usernames or by restricting it to a subset of the allowed character classes.
Going off of how Kitsune does it, we chose the latter route by simply restricting the usernames to pretty much the regex I wrote above. That way it adheres to the standard and allows for usernames like 사용자이름
or i18n_ユーザー名
)
isn't our case what https://www.unicode.org/reports/tr31/#Default_Identifier_Syntax could be a good fit for?
Note that Default Identifier Syntax you linked doesn't accept some portion of usernames accepted by existing implementations, such as 42UL
and x-1
, which are generally inappropriate as identifiers of programming languages, but not necessarilly so as usernames.
If we are to adopt the Unicode identifier syntax, I think we need to have a custom profile modified from the default identifier syntax. In oarticular, to allow what Mastodon accepts today (https://github.com/swicg/activitypub-webfinger/issues/9#issuecomment-1931344738), we would need the following extensions (as suggested by the UAX in Table 3 and 3a):
[\p{Nd}_]
)[.-]
)https://github.com/swicg/activitypub-webfinger/issues/9#issuecomment-1980045402:
- Implementations MUST apply a Unicode normalization (NFKD or NFKC) before comparing usernames
On reflection, the explicit requirement for Unicode normalization isn't necessary because the PRECIS IdentifierClass already implies NFKC by disallowing characters in the HasCompat
category.
Then, the only major security concern not covered by the WebFinger and acct:
URI scheme specs is the homograph attack, I suppose?
Wait @trwnh why did you close this issue with completed? The very valid points that were brought up in this issue are not at all resolved in b127874.
@wakest i don't fully remember, but if there are issues with the report in its current state then please file new tickets for those
EDIT: also don't forget the other tagged commits https://github.com/swicg/activitypub-webfinger/commit/3f2bf40993a239c596d6b7d15d2898c749c3564d and https://github.com/swicg/activitypub-webfinger/commit/648ae7665d77bbf163fe9279e1886ba2c597f579 -- perhaps one of those addresses your concerns?
How does the acct: URI format constrain the structure of
preferredUsername
?https://www.rfc-editor.org/rfc/rfc7565.html#section-7