[v2] Unicode punctuation: Go Big Or Go Home?

zkat commented 3 months ago

So I've been working on the kdl-rs update for v2, and something that jumped out at me is that even though we're doing multiple unicode = signs for property delimiters, we're still using "regular" characters for quotation marks and curly braces.

After thinking about it, I'm wondering if the equals sign thing is just silly? I do like the idea, at the security level, that foo＝bar can't pass itself off as an argument with a string value of "foo＝bar", as opposed to a property. That seems important!

But folks can still do:

foo｛ bar ｝

(those are the fullwidth versions of the curly braces)

or:

foo “bar”

(fancy quotes).

I think the only thing that would leave is the node termination semicolon, which also has various unicode variants.

So the question is: Do we expand the treatment we gave = to all the other punctuation, thus making it so "a kdl document means basically what it looks like, no surprises", or do we roll back the equals change and surrender to our unicode overlords altogether? I think we should go big or go home on this one. Doing it halfway doesn't feel right.

I do think it would be cool to include all the various unicode variants, though. That's not a thing you see very often...

zkat commented 3 months ago

oh this is an even deeper can of worms.

Number signs for keywords? fullwidth characters for keyword names? « » and 「」 quotation marks? type annotation parens?

otoh, if we do this and we nail it, it means people can write kdl documents in their native language and input modes, and it'll Just Work, and there won't be any surprises?

This also affects reserved KQL characters in identifiers, (and KQL itself)

SnoopJ commented 3 months ago

Summarizing some of the thoughts I had from discussing this with you on fedi, hopefully not adding too much noise here:

Doing this for = is feasible, but I'm not sure if it's all that meaningful to human beings (who are probably ~always going to type =. I would expect ＝ to be a copy/paste artifact more often than I'd expect it to be deliberate use). OTOH, implementing it is straightforward if the feature rests on normalization (this issue taught me that ＝ normalizes to = under NFKC :smile_cat:!)
For keyword names or anything else that counts as an 'identifier', the guidance of UAX #31 probably suffices (i.e. equivalence defined by normalization, with the possibility for a custom profile)
Doing this for quotes and other punctuation sounds like it could be really hard to get correct. For puncutation that has one obvious pair partner («», 「」, 【】, etc.) I can see how it would work (you figure out which pair you're in and look for the terminator after you've seen the opener), but there are non-symmetrical cases that seem tricky.
- Pathological example: suppose a user from a region where the 'natural' quoting style is „…“ and a user from a region where „…” is used are collaborating on a document written in KDL. The first one writes out a string like „This is some „example text”“ and to them this makes sense: the string isn't finished until “, but to the second user, I can see how that looks like a syntax error, because the string terminated at ” and there's a trailing mark.
- I am not even sure if these hypothetical users need to be from different regions for this construction to work. Wikipedia tells me that Ukraine uses both styles as alternative punctuation, although one is marked as "rarely used"
- Unicode does define start/end (Ps, Pe) and initial/final (Pi, Pf) categories of punctuation that seem to mostly encircle human usage, but the above pathological example defies that approach because „…“ starts with a character in Ps and ends with a character in Pi, so a rule that tried to look for a Ps, Pe or Pi, Pf pair is going to be confused. I don't know if this is an example so pathological that it can be addressed with an ad-hoc rule, but my gut tells me that maybe there are others.

So…the sum there is that I find it a commendable goal to support human language, and the context-free stuff (=, keywords/identifiers) seems like it can rely on existing guidance from the Unicode Consortium, but punctuation that relies on context seems much harder to get correct.

zkat commented 3 months ago

So it turns out, after I looked into it, too, that UAX #44 defines character classes for opening/closing brace pairs, using the Ps and Pe tags. It also provides the same thing for quotes, with Pi and Pf.

This gets us 90% of the way there, I think. The rest is picking out the exceptions we want for one meaning or another, and reserve those for, e.g. "curly" semantics vs "type annotation" semantics. We can also exclude some as needed. Pi and Pf can cover all the quotes, as well, and we can choose to manually exclude a couple of them (single quotes, basically).

There's also things like the EU's guidance on quotation marks, but I think unicode's tagging would be enough for us tbh.

zkat commented 3 months ago

Sorry, it took me a sec to take a break from stuff so I could sit and read this properly: I understand the pathological issue, but I'm wondering: what if we just let "any Ps matches any Pe", for the sake of simplicity, and "Any Pi matches any Pf"?

It does mean that things can look weird, but they won't be _dangerously wrong. It just means you might have a document that looks like foo «string 」 and you know, I don't really care about full correctness there. What matters is that foo « here「and here 」 is invalid, since there's two Pis in a row. What do you think about that?

As far as the = sign situation goes: it's actually a lot easier than you'd think! I actually thought about it while writing one of the examples for KDL that showed it supports unicode. I was in a CJK entry mode, and typed = but got ＝ because that's just what you get when you're in that mode! It seemed only right that I wouldn't have to switch back just to be "syntactically correct", when it seemed perfectly reasonable to want to do that.

SnoopJ commented 3 months ago

what if we just let "any Ps matches any Pe", for the sake of simplicity, and "Any Pi matches any Pf"?

I think the worst-case scenario is that KDL would cover a fairly wide swath of natural language with maybe some edge cases. Maybe that's not a very big deal, especially as I'm parachuting into this issue without a proper understanding of the project's goals, attracted by the interesting problem.

I was thinking of cases like «string 」 as well when considering it, but as you say, maybe this isn't such a big deal.

What matters is that foo « here「and here 」 is invalid, since there's two Pis in a row. What do you think about that?

I think it boils down to whether or not anybody would ever want to write a string whose contains include other punctuation that could have opened the string as KDL's parser sees it. If that's not something you want to support, then I think doing it the way you describe where you always open with Pi and close with Pf seems reasonable (modulo Ps, Pe being also in the mix here, but maybe that edge can be rounded off as not important enough to fuss about?)

zkat commented 3 months ago

One of the goals of kdl, for me, is to have a human oriented configuration languages. Humans are messy, they type things in funny ways, and it's important for kdl to support that and protect them from the worst cases (like, we disallow some bidi stuff because it can be malicious and has little actual usage)

If they want to write it literally into a string, they can use raw strings (#"«string 」"#), or escape them with \, so there's a reasonable, imo, escape hatch.

(which reminds me, I guess we need NFKC for \ as well, huh)

scott-wilson commented 3 months ago

I have to admit, this does scare me. My main issues are that this does add a lot of complexity that may require a lot of complex rules to manage, and means that you have to keep up with updates to Unicode and understanding of what characters mean.

I personally vote to keep the document syntax as simple as possible.

This reminds me of another file format called Collada. It is an XML 3d asset/scene format that basically allowed the exporter to dump the data in any way it wanted (Z up vs Y up, calling things blend shapes or shape keys, etc). This made it incredibly hard for importers to be able to consume the data, and basically why you don't see much support for that format. The reasons why FBX and GLTF are more successful is because they expect the data to be formatted in a certain way/make the exporters do the hard work, so that it is easier for importers to behave. The reason why I bring this up is it is an example of what could happen if this extra complexity gets introduced.

But, that's my two cents. I do agree with the idea of making the document format really great for humans, but I'm concerned with how this will become a maintenance burden/bad for machines.

zkat commented 3 months ago

@scott-wilson I’m heading something in the direction of “look up the current table of Unicode characters under the specific tags we want, and name every pair in the spec, in specific tables (just like we do with newlines and whitespace and equals right now), and possibly, MAYBE, add a clause that says “implementations MAY extend this table if future revisions of Unicode introduce new pairs”, but that last clause is… not very likely to fire, and if it does fire, we have the tools to specify consistent semantics.

In the end, I see this working the same way as our other Unicode support stuff (with the tables), not some mysterious vague mention that leaves questions in the air.

Does that help you?

scott-wilson commented 3 months ago

That does help, but I'm still worried about some of the rules with stuff like opening and closing characters. For example, if we had something like {blah}, but { is a standard curly brace, and } is not, then would that be considered okay, since everything will be mapped down to basically curly braces? Or would we say "Yes, these are effectively curly braces, but since they do not match in a human understandable way, we won't accept them"?

Also, I admit that my worries could be 100% unfounded. Right now alarm bells are ringing in my brain, and I'm still trying to understand what it is that is triggering my reaction to this idea.

zkat commented 3 months ago

@scott-wilson from what we were talking about above, non-"matching" openers and closers would be valid, as long as they're the same "class", but that's not necessarily how we have to do it. We could require they be matched by their paired opener.

The idea is that yes, your example would ideally "just work", because, honestly, it looks like it should. I don't see much of a reason to say it shouldn't.

tabatkins commented 3 months ago

It does mean that things can look weird, but they won't be _dangerously wrong.

I do think it's dangerously wrong that I would be incapable of writing "the french use «quotes» like that" without remembering to escape the ».

If we're allowing a bunch of quote styles, this suggests we should also allow the ASCII ', and the above rule would be even worse then, as every string like "hey, don't do that" would be broken.

Unless (if I'm reading between the lines correctly on a later comment of yours) you're requiring that paired quoting characters come in pairs? So my first example using french quotes is fine, because the ending french quote associates with the opening french quote, and thus doesn't close the string?

(ASCII apostrophe would still be a problem.)

tabatkins commented 3 months ago

Or hm maybe I misinterpreted your sentence

What matters is that foo « here「and here 」 is invalid, since there's two Pis in a row.

and you're suggesting not that there's pair tracking, but that using an opening quote character in a string at all is invalid, since it would attempt to pair with the quote that actually opened the string, and that's invalid?

If so I really don't like that. It feels like a big footgun that I'd have to escape or raw-string all quote characters, ever.

zkat commented 3 months ago

Thinking about The Way Things Are Done, it does make sense that we would do pair tracking, which is what every programming language that lets you use both ' and " for strings does. So in that case you WOULD be able to write your example string as is, and only escape things if you were using guillaumes as your opener/closer

how does that sound, @tabatkins ?

tabatkins commented 3 months ago

If we're tracking the "appropriate" closer for the given opener, and you don't need to escape anything inside the string except for the appropriate closer, then I'm a lot happier, yeah. (There might be multiple valid closers for a given opener, per @SnoopJ's example of „…“ and „…” and a few more obvious examples in the Wikipedia table, but I think those are all pretty reasonable.)

(I'm neutral to weakly positive on the overall change; your reasoning makes sense, but it's relatively unorthodox in this space.)

alightgoesout commented 3 months ago

I am both terrified and very excited by this.

For what it's worth, the most common French keyboard layouts make it very difficult to type « guillemets ». I use a regular French PC keyboard and I have to rely on my OS to enter them (I use compose key on Linux, on Windows I have to install a third party tool). 99.9% of French people incorrectly use " instead.

Also, typographically correct guillemets must include a narrow non-breaking space after the opening one and before the closing one. I am sure that a lot of other languages have little quirks like this. I don't think we can cover all the cases, but we are going to set expectations.

tabatkins commented 2 months ago

I highly doubt people using guillemets in their documents actually insert nnbsps into their source as a general rule, anyway. That sounds like something done by a typesetter. ^_^

zkat commented 2 months ago

More stuff in favor of quotes and how to implement them: CSS has a quotation system that automatically changes based on language: https://www.w3.org/TR/css-content-3/

tabatkins commented 2 months ago

Tho that's purely a display artifact, and doesn't auto-match anything - you have to provide the opening and closing quotes yourself. It just lets you write a <q>quoted</q> word in your source and have the quotes localized (using several :lang(...){quotes:...;} rules). Or automatically use double-quotes on the top-level, and single quotes when nested inside of another quote, that kind of thing.

alightgoesout commented 2 months ago

I highly doubt people using guillemets in their documents actually insert nnbsps into their source as a general rule, anyway. That sounds like something done by a typesetter. ^_^

While I don't use narrow non-breaking spaces, I do use regular non-breaking spaces. Most French people use straight quotes (because that is what is on their keyboard). Those who use guillemets use regular spaces, but word processors will replace them with non-breaking spaces (or even insert the space if it was not entered). I know nobody will write KDL in Word, but if the goal is to allow people to write KDL as they would write their natural language, people will add spaces and they will expect those spaces to be part of the punctuation, not the quoted string.

tryoxiss commented 2 months ago

I do like this idea, but I worry about its affect on performance. It would drastically incresae the amount of cases needed to be handled by parsers, which seems like it would make a decent diffrence. I don't know how KDL benchmarks now, its defintely not focused on performance from what I have seen, but I can't imagine its irrelevent either, since you mention its meant to also be used as a serialization language which have the singular goal of performance for thier neiche.

zkat commented 2 months ago

I'd be very surprised if a modern parser really slowed down from having to check a ~dozen cases for a certain token. It's really not a very big number.

tryoxiss commented 2 months ago

I'd be very surprised if a modern parser really slowed down from having to check a ~dozen cases for a certain token. It's really not a very big number.

Fair enough. Partially its not knowing how many punctuations there are (maybe theres like 40 types of quotes, I don't know!), and with what is mentioned above they are paired rather than just "find an open, find a close" so the state would need to be stored. Both scale badly with a lot more options, but a few dozen is likely more than fine.

zkat commented 4 weeks ago

to update on this: I've recently started writing a lot of kdl by hand on my phone because I'm using (v1) in Iron Vault, and one thing that stood out is that I have to constantly remember to long-press on the double quotes to pick "programmer quotes".

So, I think it's a great idea to specify this, at least for some of the usual suspects.

niluxv commented 2 weeks ago

Terrifying. I'm sorry for being a bit sceptical; I don't want to spoil anyone's excitement, but please let me share some reasons why I think "go home" would be the better/safer choice here:

This does add a significant amount of complexity to the language. Complexity is usually bad. For specifications (harder to read and understand correctly), implementations (both harder to write, and more likely to contain bugs) and users (more cognitive overhead, and unexpected results).
I think Snoopj and alightgoesout already established that getting 100% there is impossible, or at the very least unfeasible (e.g. what to do with spacing around guillemets, different quotation styles conflicting with one another). I would say that mostly just works is worse than completely unsupported. Mostly works means that there are rough edges, unexpected behaviour. Just requiring the use of ASCII =, " and \ is clear and obvious; unicode support with some edge cases means cognitive overhead and confusion.
Unicode is a moving thing. There are two options: (i) move along with unicode (implementations need to use the latest unicode information tables), or (ii) fix unicode version/recognised delimiter pairs. (i) means that the meaning of a KDL file would change through time, which is probably the worst property a configuration/serialisation/document format can have. (ii) is IMHO the lesser bad, but it means that inconsistencies will accumulate (until the next major version bump for KDL).

Thus, it seems to me, "go big" conflicts with most of the design principles of KDL: Maintainability, Cognitive simplicity and Learnability, Ease of de/serialization, Ease of implementation. Only flexibility might be served by it.

kdl-org / kdl

[v2] Unicode punctuation: Go Big Or Go Home? #386