kdl-org / kdl

the kdl document language specifications
https://kdl.dev
Other
1.12k stars 62 forks source link

KDL 2.0: Supporting RTL documents? #363

Closed zkat closed 9 months ago

zkat commented 9 months ago

Currently, KDL 2.0 bans all direction-flipping special unicode characters as literal characters in documents. If I understand correctly, this also means that you couldn't have a KDL document that would look "right" in a RTL language like Arabic or Hebrew.

What are our choices here?

On top of the issue with direction-flipping characters, I think we would need to consider doing things like flipping {} somehow (or supporting "} opens, { closes", somehow).

These may seem like small things, but they significantly increase accessibility of the language to a wide range of non-native (or non-) English speakers.

zkat commented 9 months ago

/cc @tabatkins

I don't have a lot of xp with this, so idk what to really do here.

porglezomp commented 9 months ago

On top of the issue with direction-flipping characters, I think we would need to consider doing things like flipping {} somehow (or supporting "} opens, { closes", somehow).

I believe at least this part should be unnecessary (or even detrimental?)—Unicode specifies that parentheses are opening and closing characters (and not actually left and right despite their old descriptive names) and has algorithms for rendering the characters mirrored when contextually appropriate. I would expect the opening delimiter to be encoded as U+007B “Left Curly Bracket” even if it’s rendered mirrored.

tabatkins commented 9 months ago

Yeah, absolutely do not change characters - in RTL languages, ASCII { is still the opener, it's just flipped on display (and same with (, [, etc).

KDL only uses weakly-directional characters in its syntax, which is generally a help. With your editor set to RTL it works just fine if you're only using RTL characters in your strings and idents. For example, given the following simple English document:

one two="three" four {
    five
}

The direct translation to Arabic looks like:

واحد إثنان="ثلاثة" أربع {
    خمسة
}

(The words I'm using are, in order:)

واحد
إثنان
ثلاثة
أربع
خمسة

So you can see that the entire thing flips into the expected order. Even mixing word directions works reasonably well:

one إثنان="three" أربع {
    five
}

There's definitely some confusion that can happen, but it's no worse than what you'd get in a mostly-English document with a little Arabic sprinkled in.

Things can get more confusing if your editor isn't set to RTL, but again, that's expected; it'll happen with any block of RTL text in an LTR editor. (Note that all of the code blocks in my comment explicitly have dir=rtl lang=ar set.)

So in general, the correct answer is "don't worry about it". We're already in a better situation than most programming languages, where the preponderance of English-language keywords means there's a lot of strong-ltr content around causing flipping issues. And in any case, manually inserting directional overrides is not easy or generally recommended in any context, really.


Here's a confusing bit - a run of English will end up staying in its original order, rather than flipping. In this example, #true is the value of the property, and is followed by a three attribute (aka one two=#true three four), but it ends up looking almost backwards.

one إثنان=#true three أربع

Again tho, this is just something that happens to all mixed text, regardless of whether it's primarily LTR or primarily RTL. A good editor would automatically tag wrong-direction words with bidi isolation, so they'll move around properly, like:

(GitHub strips the <bdi> markup here, so the example still displays wrong. It should look like this: image, as in this live example

But I'm not sure what the state-of-the-art in RTL-supporting editors is these days. Sublime does badly with it, at least.

zkat commented 9 months ago

@tabatkins won't our ban on directional characters prevent this from working, though? Do we need to make changes to those constraints?

tabatkins commented 9 months ago

No, like I said:

And in any case, manually inserting directional overrides is not easy or generally recommended in any context, really.

Unless you mistakenly thought I was referring to those when I said "strongly directional" vs "weakly directional"? Those terms just mean characters that carry an intrinsic direction with them (like a, which is strongly ltr) vs those that will adopt the directionality of the content around them (like =, or (, or whitespace.)

The actual Unicode characters that say "fuck this text, everything after this point is rtl" (the direction-override chars) are funky exceptions that are only meant to be used in weird legacy contexts where there's no ability to use better isolation and direction-tagging. They're super not meant for actual text authoring.

zkat commented 9 months ago

alright. It sounds like I can just close this, and our 2.0 changes won't prevent folks from doing RTL docs. Thanks