subconsciousnetwork / subtext

Markup for note taking
Apache License 2.0
271 stars 20 forks source link

Consider how to handle unicode characters in slashlinks #28

Open gordonbrander opened 2 years ago

gordonbrander commented 2 years ago

Following up on https://github.com/gordonbrander/subtext/issues/25#issuecomment-1020480396.

Slashlinks currently allow only ALPHA, DIGIT, -, _ and / in their grammar.

If we take hashtags as prior art, we can see that Twitter allows only Unicode alphanumeric and underscore characters in hashtags. So this works: #大, but this does not: #䷊.

We could update the basic grammar for slashlinks (in regex terms) to:

/[\w\-_/]+

In programming environments that support Unicode, the \w character class should include Unicode Letter and Number classes (A subset? Or the whole set? I'm not sure.).

On the other hand, URLs disallow unicode characters, and require them to be percent encoded (see https://datatracker.ietf.org/doc/html/rfc5234 and https://stackoverflow.com/questions/2742852/unicode-characters-in-urls). Most browsers convert these percent-encoded values to the actual characters for display purposes.

If slashlinks are like simplified truncated URLs, should we follow the same path as browsers? Or should we follow the hashtag approach?

gordonbrander commented 2 years ago

My intuition is that we should err on the side of supporting unicode as written.

OTOH I'm not well-versed in the history that led to percent-encoding in URLs. I suspect there are good technical reasons to consider, so maybe worth some study. Additional consideration: how copy-paste-able would a non-percent-encoded slashlink be in a traditional web browser?

I plan to spend some time reviewing URL specs to understand the history of URL percent-encoding. (I wonder if it was a backwards-compat thing? URLs may have preceded widespread unicode support.)

gordonbrander commented 2 years ago

Points in favor of leaning into URL syntax as-spec'd:

gordonbrander commented 2 years ago

@cdata flagged IRIs which expand URI grammar to include most unicode characters, and have a backwards-compatible encoding scheme https://en.m.wikipedia.org/wiki/Internationalized_Resource_Identifier

This seems like a compelling path forward that solves for both unicode, and conformability with URL standards.

gordonbrander commented 1 year ago

Decision: support Unicode characters in slashlinks, just like we do everywhere else. Just do whatever we need to do on the backend to encode/percent-escape when converting to URL form.

This probably means defining a fully-qualified URL grammar for slashlinks, along the lines of sphere://, and rules for encoding Unicode as URL. I'm guessing there are IETF specs for this and we should just use them.