subconsciousnetwork / subtext

Markup for note taking
Apache License 2.0
271 stars 20 forks source link

URLs cannot be followed by punctuation #25

Open gretchenfrage opened 2 years ago

gretchenfrage commented 2 years ago

I'm tinkering with subtext (here's a small parser/formatter/html-formatter I made, if you're curious), and I noticed something. The way slashlinks are defined prevents them from being followed by punctuation, such as periods or commas. To demonstrate how natural it is to follow slashlinks with such punctuation, the main example in this repo's README does that:

# Heading

Plain text.

- List item
- List item

> Quoted text

URLs like https://example.com are automatically linked.

You can also link to local pages using short /slashlinks.

Unless I'm interpreting this incorrectly (which I might be), the final /slashlinks. fails to parse as a slashlink because it is not followed by whitespace nor EOF (which is how the regex in the spec defines them (^|\s)(/[a-zA-Z0-9/\-\_]+)($|\s)). To get the example to parse correctly, I have to insert a space, so it says /slashlinks ..

Typical URLs have a similar but different complication: with the defined parsing strategy, trailing punctuation will be considered part of the URL. For example, if I try to parse the subtext:

You can find that at https://crouton.net.

It interprets that as a link to https://crouton.net..

I'm not sure which solution is best, or if solving this would cause more complexity than it's worth. But at least in the case of slashlinks, changing the regex to (^|\s)(/[a-zA-Z0-9/\-\_]+)($|\s|\.|,) allows a slashlink to be followed by periods and commas specifically. Of course, that's not the most elegant solution, it's a hard-coded edge case, and perhaps in an anglocentric way as well.

gordonbrander commented 2 years ago

@gretchenfrage you're right. Will revise spec to fix.

gordonbrander commented 2 years ago

I think dropping the trailing word-boundary condition should fix it. Going to check on my Swift Subtext parser implementation, which works as expected. I think I must have fixed this in code, but forgot to revise the spec.

Update: swift parser mostly works, but trips up on brackets. It looks like I need to look at my logic a little more closely.

gordonbrander commented 2 years ago

Related unicode character set: Punctuation https://en.wikipedia.org/wiki/General_Punctuation

Hashtags as prior art. Twitter allows only alphanumeric and underscore characters in hashtags. It does seem to allow unicode alphanumerics (not just ascii). https://stackoverflow.com/questions/14823376/what-characters-are-allowed-in-twitter-hashtags

This works: #大 This does not: #䷊

gordonbrander commented 2 years ago

Posit: we should update the basic grammar for slashlinks to (in regex terms):

/[\w\-_/]+

That resolves the trailing punctuation issue, and also clarifies that slashlinks are allowed to contain non-ascii unicode word characters.

Update: on second thought, URLs disallow unicode characters, and require them to be percent encoded (see https://datatracker.ietf.org/doc/html/rfc5234 and https://stackoverflow.com/questions/2742852/unicode-characters-in-urls). Most browsers convert these percent-encodings to the actual characters for display purposes.

If slashlinks are like simplified truncated URLs, should we follow the same path? Going to file a separate issue to track the unicode question.

gordonbrander commented 2 years ago

Re-opening. Realize the URL syntax must be refined as well. URLs have different requirements to slashlinks.