Non-ASCII directive names

micromark / micromark-extension-directive

micromark extension to support generic directives (`:cite[smith04]`)

https://unifiedjs.com

MIT License

29 stars 16 forks source link

Non-ASCII directive names #23

Open viktor-yakubiv opened 8 months ago

viktor-yakubiv commented 8 months ago

Initial checklist

[X] I read the support docs
[X] I read the contributing guide
[X] I agree to follow the code of conduct
[X] I searched issues and couldn’t find anything (or linked relevant results below)

Problem

I write text files using an extended markdown syntax with a flavour for specific needs. Those text files are not in Latin script. I want to keep them in a uniform language without formatting prompts in English.

Markdown in general appears to have a language-independent syntax. ASCII-limited directives bring language-dependence.

Specific example

I am a Ukrainian speaker, creating a project for the local community with no internationalisation need in the future. I want to keep files in my native language as much as possible and have syntax as simple as possible. My text files are songs. Sometimes, they contain a chorus that repeats after each verse (paragraph). Take a timely example: ``` Dashing through the snow In a one-horse open sleigh O'er the fields we go Laughing all the way Bells on bob tail [sic] ring Making spirits bright What fun it is to ride and sing A sleighing song tonight! Oh! :::chorus Jingle bells, jingle bells, Jingle all the way. Oh! what fun it is to ride In a one-horse open sleigh. Hey! Jingle bells, jingle bells, Jingle all the way; Oh! what fun it is to ride In a one-horse open sleigh. ::: A day or two ago I thought I'd take a ride And soon, Miss Fanny Bright Was seated by my side, The horse was lean and lank Misfortune seemed his lot He got into a drifted bank And then we got upsot. A day or two ago, The story I must tell I went out on the snow, And on my back I fell; A gent was riding by In a one-horse open sleigh, He laughed as there I sprawling lie, But quickly drove away. Ah! Now the ground is white Go it while you're young, Take the girls tonight and sing this sleighing song; Just get a bobtailed bay Two forty as his speed Hitch him to an open sleigh And crack! you'll take the lead. ``` My custom script detects the chorus and repeats it after each paragraph. However, `chorus` in Ukrainian is `приспів` and I would love to keep that native word in a Ukrainian text.

Solution

Configurable naming limitations.

Alternatives

Find and replace before directive parsing.
A forked parser with a patch

viktor-yakubiv commented 8 months ago

Here I shared my specific problem. I don't object to the current implementation with the imposed limitations backed up with solid reasoning in the readme about spacing and trailing colons.

I would love to understand the rationale behind limiting the directive naming.

ChristianMurphy commented 8 months ago

@wooorm may be able to offer more context. From reviewing the description/spec https://talk.commonmark.org/t/generic-directives-plugins-syntax/444 I believe the intent is to be roughly compatible with html/custom element naming conventions https://html.spec.whatwg.org/multipage/custom-elements.html#valid-custom-element-name https://developer.mozilla.org/en-US/docs/Web/API/CustomElementRegistry/define#valid_custom_element_names which require the sequence start with an ASCII character (the difference being that directives do not require a dash).

wooorm commented 8 months ago

The reason the current state is the way it is, is so that I didn’t have to decide.

Custom elements looks like a good thing to be compatible with. Although I don’t think a) the -, b) the disallowed uppercase, c) the disallow list such as font-face and such needs to be enforced. That is to say: it’s not bad if we allow some names that aren’t strictly compatible with HTML custom elements.

I wonder whether we need to enforce the disallowed ASCII punctuation/symbols though. I can see $ being useful, as it’s in JS too. Putting say ( or ' or / or ; in there seems weird. Although, as HTML allows much of those characters in attribute names, perhaps we can allow them too? Otherwise we should have different handling for “tag” names and “attribute” names.

Maybe simplest is to allow all unicode characters that are not unicode whitespace? https://github.com/micromark/micromark/blob/929275e2ccdfc8fd54adb1e1da611020600cc951/packages/micromark-util-character/dev/index.js#L232

viktor-yakubiv commented 8 months ago

@wooorm and @ChristianMurphy thank you for sharing your details. I also have assumed ~custom-elements~ (rather) HTML elements naming convention but I wanted to clarify this. If this is not a strict requirement, I would appreciate a change.

Thinking of a potential solution, character ranges listed in the HTML standard for custom element names seem to be reasonable to me. The PCENChar (potential custom element name character) is quite wide; it seems to allow all "alphabets", including characters needed in my case.

PCENChar ::=
  "-" | "." | [0-9] | "_" | [a-z] | #xB7 | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x203F-#x2040] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]

Yet, it is beyond the proposed simplest solution and still enforces some limits. What do you think?

Script, I used to preview ranges

I am not knowledgeable in the Unicode char ranges, so I asked ChatGPT what range numbers mean (extended Latin, Japanese, Greek, Cyrillic etc) and reviewed the list manually using a script. ```js // "-" // "." // [0-9] // "_" // [a-z] chars.push(String.fromCharCode(0xB7)) for (let i = 0xC0; i <= 0xD6; ++i) chars.push(String.fromCharCode(i)) for (let i = 0xD8; i <= 0xF6; ++i) chars.push(String.fromCharCode(i)) for (let i = 0xF8; i <= 0x37D; ++i) chars.push(String.fromCharCode(i)) for (let i = 0x37F; i <= 0x1FFF; ++i) chars.push(String.fromCharCode(i)) for (let i = 0x200C; i <= 0x200D; ++i) chars.push(String.fromCharCode(i)) for (let i = 0x203F; i <= 0x2040; ++i) chars.push(String.fromCharCode(i)) for (let i = 0x2070; i <= 0x218F; ++i) chars.push(String.fromCharCode(i)) for (let i = 0x2C00; i <= 0x2FEF; ++i) chars.push(String.fromCharCode(i)) for (let i = 0x3001; i <= 0xD7FF; ++i) chars.push(String.fromCharCode(i)) for (let i = 0xF900; i <= 0xFDCF; ++i) chars.push(String.fromCharCode(i)) for (let i = 0xFDF0; i <= 0xFFFD; ++i) chars.push(String.fromCharCode(i)) for (let i = 0x10000; i <= 0xEFFFF; ++i) chars.push(String.fromCharCode(i)) console.log(chars.join('\n')) ```

wooorm commented 8 months ago

Some more considerations:

Allowing most custom element names is indeed nice, but it’s not a goal to only support custom element names. Directives are not only useful for HTML. One existing example is that Docusaurus treats them as an alternative to JSX. Meaning the names should also be able to match (most) JS identifiers.
In HTML, tag names and attribute names can match basically anything, because these names can only occur in special places. The < and whitespace and = and / and > are very strong indicators of where the parser is. In markdown, this is more complex. Is :a*b* an a directive followed by emphasis or an a*b* directive? Is :a$b$ an a$b$ directive, or is b math, when enabled?

Custom elements allow basically all higher-than-ascii punctuation, and in the ASCII range -, ., _. JavaScript identifiers do not allow most punctuation, but allow $ and _ in the ASCII range. In markdown, all ASCII punctuation either already is something in CM (_) or could be something (such as $ for math).

So I’d prefer starting with few ASCII punctuation, we can expand later:

Disallow all whitespace/controls
Disallow ascii punctuation, except allow ., -, _
Allow the rest (basically alphanumerical and higher-than-ascii punctuation)

viktor-yakubiv commented 8 months ago

basically alphanumerical and higher-than-ascii punctuation

@wooorm do you have \w in mind or anything else?

I have found that /[\p{L}\p{N}][\p{L}\p{N}.-_]*/u might work just fine, where \p{N} is a Unicode number, and \p{L} is a Unicode letter (docs, look for _# GeneralCategory).

This may be expanded to:

export const unicodeAlphanumeric = regexCheck(/[\p{L}\p{N}]/u)

If we come to an agreement, I could prepare a pull request. What do you think?

wooorm commented 8 months ago

We already have the parts in micromark. I think this is fine:

const fine = code <= codes.del
  ? code === codes.dash ||
      code === codes.dot ||
      code === codes.underscore ||
      asciiAlphanumeric(code)
  : classifyCharacter(code) !== constants.characterGroupWhitespace

Using asciiAlphanumeric from micromark-util-character, classifyCharacter from micromark-util-classify-character, and codes and constants from micromark-util-symbol!

wooorm commented 8 months ago

Note I think similar rules need to be applied to attribute names. They are a bit more complex because say .a.b is already a shortcut for two classes.

Attributes are also prohibited from starting with an ASCII number (they’re currently only accepting ASCII too). I wonder if that’s needed.