Open gordonbrander opened 2 years ago
My intuition is that we should err on the side of supporting unicode as written.
OTOH I'm not well-versed in the history that led to percent-encoding in URLs. I suspect there are good technical reasons to consider, so maybe worth some study. Additional consideration: how copy-paste-able would a non-percent-encoded slashlink be in a traditional web browser?
I plan to spend some time reviewing URL specs to understand the history of URL percent-encoding. (I wonder if it was a backwards-compat thing? URLs may have preceded widespread unicode support.)
Points in favor of leaning into URL syntax as-spec'd:
@cdata flagged IRIs which expand URI grammar to include most unicode characters, and have a backwards-compatible encoding scheme https://en.m.wikipedia.org/wiki/Internationalized_Resource_Identifier
This seems like a compelling path forward that solves for both unicode, and conformability with URL standards.
Decision: support Unicode characters in slashlinks, just like we do everywhere else. Just do whatever we need to do on the backend to encode/percent-escape when converting to URL form.
This probably means defining a fully-qualified URL grammar for slashlinks, along the lines of sphere://
, and rules for encoding Unicode as URL. I'm guessing there are IETF specs for this and we should just use them.
Following up on https://github.com/gordonbrander/subtext/issues/25#issuecomment-1020480396.
Slashlinks currently allow only ALPHA, DIGIT,
-
,_
and/
in their grammar.If we take hashtags as prior art, we can see that Twitter allows only Unicode alphanumeric and underscore characters in hashtags. So this works:
#大
, but this does not:#䷊
.We could update the basic grammar for slashlinks (in regex terms) to:
In programming environments that support Unicode, the
\w
character class should include UnicodeLetter
andNumber
classes (A subset? Or the whole set? I'm not sure.).On the other hand, URLs disallow unicode characters, and require them to be percent encoded (see https://datatracker.ietf.org/doc/html/rfc5234 and https://stackoverflow.com/questions/2742852/unicode-characters-in-urls). Most browsers convert these percent-encoded values to the actual characters for display purposes.
If slashlinks are like simplified truncated URLs, should we follow the same path as browsers? Or should we follow the hashtag approach?