toml-lang / toml

Tom's Obvious, Minimal Language
https://toml.io
MIT License
19.33k stars 845 forks source link

Not all emojis work as bare keys #954

Open arp242 opened 1 year ago

arp242 commented 1 year ago

I was writing test cases for this, and using a pirate flag (🏴‍☠️) doesn't work; this is:

     CPoint  Dec    UTF8        HTML       Name (Cat)
'🏴' U+1F3F4 127988 f0 9f 8f b4 🏴  WAVING BLACK FLAG (Other_Symbol)
'�'  U+200D  8205   e2 80 8d    ‍      ZERO WIDTH JOINER (Format)
'☠'  U+2620  9760   e2 98 a0    ☠   SKULL AND CROSSBONES (Other_Symbol)

The flag and ZWJ is fine, but the skull and crossbones isn't allowed in the current range.

Seems confusing since most emojis work. Took me quite a bit of time to figure when modifying my parser to support this because I just assumed I missed something, but turns out it's just not in the allowed range:

unquoted-key-char =/ %x2070-218F / %x2460-24FF          ; include super-/subscripts, letterlike/numberlike forms, enclosed alphanumerics
unquoted-key-char =/ %x2C00-2FEF / %x3001-D7FF          ; skip arrows, math, box drawing etc, skip 2FF0-3000 ideographic up/down markers and spaces

Looking at the U+2500..U+2bff range, I don't really see why we need to skip a lot of these things.


I know we discussed this before, but I still think we should either allow only letters+numbers or just allow almost everything (with a few exceptions); the current behaviour is just confusing. The examples uses an emoji as an example and ZWJ is explicitly allowed, so you'd expect all emojis to work, but turns out only some emojis work. It just so happened by chance that "pirate flag" was the first emoji I tried, but there are probably others as well and with ZWJ combinations it'll be a whack-a-mole.

Either way, IMHO we should support all emojis or none. Many other ZWJ combinations do work fine; 🏳️‍🌈 (U+1F3F3 ZWJ U+1F308) or 🏴󠁧󠁢󠁷󠁬󠁳󠁿 is okay, but 🏳️‍⚧️ isn't (as U+26A7 isn't in the allowed range). In a quick test it seems all flags work, except two.

Originally posted by @arp242 in https://github.com/toml-lang/toml/issues/891#issuecomment-1383980000

arp242 commented 1 year ago

I did a quick check, and 179 emojis currently fail (the other 1530 work); here's a list: https://gist.github.com/arp242/a3b99e52c9dea2b6e2d6217aab490ad3 (that's based on Unicode 14, not 15, so there may be a few more – I need to update my tool to 15).

Also, the variation selectors (U+FE0F in the example above) are a right pain; these are pretty much invisible in most editors. These should be excluded together with all the RTL stuff (which already are).

abelbraaksma commented 1 year ago

Either way, IMHO we should support all emojis or none.

You use ZWJ for creating the emoji. While this is fine, the overlaid code point itself is not in the proper range. We’ve looked at more complex ranges, but decided against it for the added complexity it brings. There will always be some ranges of code points people may feel are missing. Using ZWJ you can ‘invent’ emojis or other characters.

I agree it is somewhat unfortunate that certain combinations are currently not possible. But keep in mind that we’re talking about code point ranges, not about characters. And what you’re describing is allowing certain characters, which is an avenue we’re trying to avoid.

abelbraaksma commented 1 year ago

That said, it’s possibly an oversight, as I don’t see anything in 2600-26FF that need be illegal. We’d have to look a little bit closer to the wider range you mention and the discussion or commit log to find out whether we did this deliberately (and then reassess whether that conclusion is still valid) or it was an honest mistake in the added ranges.

I tried to be meticulous, but hey, we’re only human ;).

Keep in mind that there’s also the argument that we don’t want to over-complicate the ranges. We try to be inclusive, and mainly ban ‘unsuitable’ ranges, while including the rest.

arp242 commented 1 year ago

Keep in mind that there’s also the argument that we don’t want to over-complicate the ranges. We try to be inclusive, and mainly ban ‘unsuitable’ ranges, while including the rest.

Yes, the current check I need to do is:

func isBareKeyChar(r rune) bool {
    return (r >= 'A' && r <= 'Z') ||
        (r >= 'a' && r <= 'z') ||
        (r >= '0' && r <= '9') ||
        r == '_' || r == '-' ||
        r == 0xb2 || r == 0xb3 || r == 0xb9 || (r >= 0xbc && r <= 0xbe) ||
        (r >= 0xc0 && r <= 0xd6) || (r >= 0xd8 && r <= 0xf6) || (r >= 0xf8 && r <= 0x037d) ||
        (r >= 0x037f && r <= 0x1fff) ||
        (r >= 0x200c && r <= 0x200d) || (r >= 0x203f && r <= 0x2040) ||
        (r >= 0x2070 && r <= 0x218f) || (r >= 0x2460 && r <= 0x24ff) ||
        (r >= 0x2c00 && r <= 0x2fef) || (r >= 0x3001 && r <= 0xd7ff) ||
        (r >= 0xf900 && r <= 0xfdcf) || (r >= 0xfdf0 && r <= 0xfffd) ||
        (r >= 0x10000 && r <= 0xeffff)
}

Which doesn't exactly fill me with joy. But I'd rather have one somewhat ugly "wtf?!" function rather than silly stuff like "😗 works but ☺️ doesn't". These are quite distinct codepoints, but grouped next to each other in "emoji ordering". The way emojis work is a bit of a mess.

abelbraaksma commented 1 year ago

I found out what the original motivation was:

https://github.com/toml-lang/toml/issues/687#issuecomment-567490186

Basically, we accept letter-like code points. Dingbats, mathematical symbols and box drawing code points aren’t ‘letter-like’. Neither are emojis, of course. But the ranges of emojis that are allowed have been added to later versions of Unicode and belong to “other languages that weren’t previously assigned”. As such, they belong to “be liberal in what to accept from future versions of Unicode”.

Perhaps the right cause of action would’ve been to exclude other non letter-like ranges from later versions. However, that brought about another downside: that ID tokens in HTML and XML would not be valid unquoted names. Several RFCs overlap with the current definition. While this is not necessarily a goal for TOML, it has its benefits.

With the arguments in the mentioned thread, I still think we’re on the right track here, using the ‘letter-like’ definition of the most widely implemented and used Unicode version (I believe that’s 5 or 6, at least .NET Framework and Windows prior to v11 (or v10?) use 5.x.).

Of course, we could allow more tokens that aren’t allowed elsewhere. Or disallow more tokens that aren’t disallowed elsewhere. This would take us further away from widely established identifier definitions, but we may choose to go down that path.

arp242 commented 1 year ago

This would take us further away from widely established identifier definitions

TOML already allows almost everything as quoted keys, so I think this doesn't matter at all. Directly using TOML keys in HTML, XML, or pretty much anywhere else without processing is already something you can't do.

Looking at some other environments, there doesn't seem that much consensus in the first place:

Note sure what other languages/formats support Unicode identifiers from the top of my head.


Going back to basics, the goals I'd set would be:

In that sense, "support emojis" is out of scope IMO; I don't think it would be horrible to lose support for it especially since you can still use them inside quoted keys. BUT having ~90% of the emojis work fine and ~10% not work is a bug IMO, especially since an emoji is explicitly included as an example.

It's probably better to allow too little and then expand on that later if there's a demand for it. Once we allow something we can never take it back because that would break compatibility. And there's also #941; we need something for that to address "minimize potential for confusion", and tweaking (i.e. limiting) the set of allowed codepoints is one possible way to address that.

pradyunsg commented 1 year ago

TBH, it looks like we should align with Unicode TR31 syntax, rather than trying to come up with something else.

It's what Go, Rust and Python seem to be doing (IIUC), and I think that might just be a more "obvious" way to achieve what we want to achieve here.

ChristianSi commented 1 year ago

The only possibly problems I see in this range are the Eight Trigrams (☰ ☱ ☲ ☳ ☴ ☵ ☶ ☷) and various symbols related to yin and yang (⚊ ⚋ ⚌ ⚍ ⚎ ⚏). Some of these, especially ⚌, look very much like the equals sign (=), therefore it might be a good idea to avoid them in unquoted keys to prevent possible confusion.

One idea: shorten the forbidden range to from U+2630 (☰) to U+268F (⚏). However, unfortunately in this shorter range there are still some very popular symbols (e.g. ☺ ♀ ♂) whose non-allowance could remain confusing.

Another idea: actually the Eight Trigrams are probably OK, since they all have three lines rather than the two of the equals sign. Two of the yin and yangs symbols (⚊ ⚋) should be fine too, since they look similar to the underscore, which is already allowed. So we could just forbid U+268C to U+268F (⚌ ⚍ ⚎ ⚏), allowing everything else in that range.

abelbraaksma commented 1 year ago

TBH, it looks like we should align with Unicode TR31 syntax

We previously decided against it, because it’s complex and, iirc, relies on categories. It’s likely (but I’d have to check) that it isn’t compatible with what currently have (apart from already allowing starting with a digit).

Edit: the TR31 set is very disjoint to what we have:

ID_Start characters are derived from the Unicode General_Category of uppercase letters, lowercase letters, titlecase letters, modifier letters, other letters, letter numbers, plus Other_ID_Start, minus Pattern_Syntax and Pattern_White_Space code points.

ID_Continue characters include ID_Start characters, plus characters having the Unicode General_Category of nonspacing marks, spacing combining marks, decimal number, connector punctuation, plus Other_ID_Continue, minus Pattern_Syntax and Pattern_White_Space code points.

The two biggest issues: it uses categories, and letter-like only. Miscellaneous Symbols, which we explicitly include, are forbidden. Also, categories are dependent on Unicode version, which we try to avoid.

The full range, for any supported Unicode version, is rather complex. In the previous thread, there’s a comment that shows how complex, and we all kinda sighed with relieve that in the end it wasn’t necessary to go that route.

abelbraaksma commented 1 year ago

Some of these, especially ⚌, look very much like the equals sign (=),

If we really want to include this range, we can do the same as we did for the Greek Question Mark (which looks like a semi colon) and just forbid only the Yin & Yang sign that looks like the equal sign.

TOML already allows almost everything as quoted keys

That’s a good point.

Allow people to use their script/alphabet of choice (Chinese, Tamil, Icelandic, Cuneiform, whatnot).

Agreed. Which is what we support. Miscellaneous Symbols do not fit that description, but I see your other point that it’s a little confusing that some ranges are currently excluded.

I’m not against including it (the range in the OP). However, I’m a little afraid that every few months we’re going to open this up again because some person’s favourite symbol isn’t allowed. I maybe wrong about this, of course, perhaps this is the ‘last missing range’.

We’ve spent many months coming to the current range, at some point we’d just have to settle and call it a day ;).

ChristianSi commented 1 year ago

@abelbraaksma: Yeah, just forbidding (U+268C) and allowing everything else in that block would be fine with me as well.

eksortso commented 1 year ago

@abelbraaksma @ChristianSi I would much rather prefer that we include the Miscellaneous Character block but exclude the two-line yin and yang symbols U+268C to U+268F, as previously suggested, due to their resemblance to the equals sign.

unquoted-key-char =/ %x2600-268B / %x2690-26FF          ; include Miscellaneous Symbols, but exclude symbols resembling an equals sign
arp242 commented 1 year ago

There are some other syntax-like homographs too:

'#' U+FF03     FULLWIDTH NUMBER SIGN (Other_Punctuation)
'"' U+FF02     FULLWIDTH QUOTATION MARK (Other_Punctuation)
'﹟' U+FE5F     SMALL NUMBER SIGN (Other_Punctuation)
'﹦' U+FE66     SMALL EQUALS SIGN (Math_Symbol)
'﹐' U+FE50     SMALL COMMA (Other_Punctuation)
'︲' U+FE32     PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation)
'˝'  U+02DD     DOUBLE ACUTE ACCENT (Modifier_Symbol)
'՚'  U+055A     ARMENIAN APOSTROPHE (Other_Punctuation)
'܂'  U+0702     SYRIAC SUBLINEAR FULL STOP (Other_Punctuation)
'ᱹ'  U+1C79     OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter)
'₌'  U+208C     SUBSCRIPT EQUALS SIGN (Math_Symbol)
'⹀'  U+2E40     DOUBLE HYPHEN (Dash_Punctuation)
'࠰'  U+0830     SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation)

That's from a quick visual inspection; not a full list. There's some more in the "Halfwidth and Fullwidth Forms" and "Small Form Variants" blocks in particular.

abelbraaksma commented 1 year ago

That’s an interesting list, but i don’t think we should try to be exhaustive here. There’ll always be certain glyphs that look confusing. Put in ZWJ and you cancreate any glyph, from smaller components.

arp242 commented 1 year ago

Maybe we shouldn't allow ZWJ?

I've been going back-and-forth on what to do about all of this. While the original issue is "Not all emojis work as bare keys", this ties in to other issues as well and there are knock-on effects.

We already allow almost everything as quoted keys. In hindsight, I think this was a mistake, but we can't change that now, and people don't use quoted keys that much since it's annoying to type (many TOML users probably don't even know you can use it) so it's less of an issue in the real world. With bare keys, people will actually start using all the stuff that's allowed.

I'm worried that allowing too much will lead to confusion. Homoglyphs are actually not something I'm very worried about since no reasonable person would use "# trollolol = 1" (U+FF03, not a "real" hash) other than maybe as a practical joke on your coworkers. No one really enters these things by accident. Other things that are explicitly excluded now like the multiplication sign (×) isn't that much of an issue either; it's very similar to the letter "x", but no one enters "×" by accident when they intended to write "x".

I think it's fine to allow TOML users to do "stupid things", and it's okay to rely on TOML users being reasonably sane.

What I am worried about are "invisible" characters such as ZWJ, variation selectors, combining characters, and things like that. All of this is very non-obvious, and easy to get confused by, even for people well versed in how all of this works (i.e. you and me).

So while "# trollolol = 1" is certainly confusing, it's not really an issue that crops up in the real world. Same with U+268C-U+268F. I think this is almost a philosophical issue: "if a tree in a forest is confusing but no one sees the tree being confusing, then is it really confusing?"

So, back to ZWJ: if we disallow ZWJ lots of emojis won't work, and to be consistent we'd have to disallow at least the commonly used emojis like 😂 and whatnot, which would make the codepoint range a bit more complex.

However, in general, I'd say:

So I'd say we probably shouldn't allow ZWJ, and variation selectors, and combining characters, and perhaps a few other things. Those are things that will lead to confusion, unlike ⚌, #, and whatnot. I don't actually care all that much about those because I don't expect anyone will be confused by it in real-world scenarios.

abelbraaksma commented 1 year ago

ZWJ is used in many scripts to create valid characters, glyphs and words. It’s not exclusive to emojis. I don’t think having it is an issue. More the opposite. The whole idea here is to be inclusive wrt languages and scripts. The side effect of this approach is that some emojis also work, because they are in codepoint ranges not explicitly excluded, mainly because these ranges weren’t assigned to in older versions of Unicode.

By en large this should be fine. Identifiers will typically be expressed in someone’s native language, script, or a common language like English, Arabic or Spanish. The need for dingbats or emojis is likely comparatively small.

I’ve no problem keeping the status quo, or adding other ranges, but whatever we do, there’ll always be new codepoints assigned and they may or may not contain non letter-like characters. These will always be in the already allowed ranges and therefore we cannot exclude pre-emptively.

ChristianSi commented 1 year ago

I'm fine with including this range except for U+268C to U+268F (⚌ ⚍ ⚎ ⚏), as @eksortso favors (https://github.com/toml-lang/toml/issues/954#issuecomment-1398590999). That'll be an easy and convenient solution.

Also I urge not to reopen the rest of the discussion about the allowed ranges. We have found a solution that allows unquoted keys in essentially any script, without burdening implementors with too much complexity. That's good, so we should just keep it that way!

arp242 commented 1 year ago

ZWJ is used in many scripts to create valid characters, glyphs and words. It’s not exclusive to emojis. I don’t think having it is an issue. More the opposite. The whole idea here is to be inclusive wrt languages and scripts. The side effect of this approach is that some emojis also work, because they are in codepoint ranges not explicitly excluded, mainly because these ranges weren’t assigned to in older versions of Unicode.

You're right, I should have addressed that. TR31 has quite a bit of special handling for it, and the way I read it even allows excluding it. Go and C# outright disallow using it.

I can see how allowing ZWJ makes sense. It's not entirely clear to me if it's needed to correctly write these languages though, or if it's optional. My thinking is "better to include too little and correct that if need be".

Variation selectors are still an issue though. I don't think there's any good reason to include them, they are commonly inserted, and very invisible.

And combining characters introduce a lot of ambiguity in string equivalence, as brought up in #941. The more I think about it, the more I feel we should do our best to reduce the potential for ambiguity as this would at least reduce potential for confusion, and the need for NFC normalisation and a Unicode library (like ICU). Perhaps we can't entirely eliminate it 100%, but just covering the common cases would already go a long way.

Also I urge not to reopen the rest of the discussion about the allowed ranges. We have found a solution that allows unquoted keys in essentially any script, without burdening implementors with too much complexity. That's good, so we should just keep it that way!

None of these specific issues were brought up before, as far as I've seen. You can disagree it's an issue, and that is of course fine, but I'd never dismiss anyone like that.

ChristianSi commented 1 year ago

@arp242: Unicode normalization issues are irrelevant here, since they apply to quoted and unquoted keys in exactly the same way. In quoted keys, arbitrary Unicode is allowed and, of course, that's not going away – in fact, it can't go away since that would break backward compatibility.

And let's not roll back on our promise that "you can use unquoted keys representing words from arbitrary languages", which we have realized in the current state. In this regard, I found Some trivial knowledge about Unicode a good read. My take from there: variant selectors are needed, at least, to wrote Mongolian correctly.

In that article, the usage of ZWJ is mostly limited to emojis, but from Wikipedia I get that it is needed to render text in various scripts (e.g. Arabic or Indic ) correctly. You're right that the text will likely still be readable without this information (I guess?) but it'll look "broken" to people. Also relevant is that text editors will likely auto-insert these ZWJs where needed. So when we tell people "you can use bare keys in Arabic script, but only without ZWJs", this might well cause all kind of parsing errors, since people will have a hard time writing keys in these scripts without this character appearing.

So yes, while you're right that it makes sense to discuss whether this character and the variant selector code block should remain included, I'd still tend to say that yes, they should.

abelbraaksma commented 1 year ago

I agree, they should. Better to be liberal in what you accept, esp when it comes to scripts in Unicode.

Go and C# outright disallow using it.

Wrt C#, this is only partially true. Identifiers in Common IL can be any codepoint, except a small handful, like NULL and FFEF, I believe. In F#, this rule is applied very liberally, and you can create identifiers in the full range Common IL allows. In C#, calling such identifiers requires a little extra work, but is still possible.

Let's not start limiting more. Either expand the ranges, or leave it as is. From the discussion above, I think the conclusion would lean towards inclusion of the extra range, as mentioned here: https://github.com/toml-lang/toml/issues/954#issuecomment-1399252339

arp242 commented 1 year ago

Unicode normalization issues are irrelevant here, since they apply to quoted and unquoted keys in exactly the same way

That is correct, but as I mentioned before quoted keys aren't used all that much, so practically it's much less of an issue with quoted keys. That we need be a bit more careful with bare keys is not controversial, otherwise we would just allow everything except [].="'.

Also relevant is that text editors will likely auto-insert these ZWJs where needed. So when we tell people "you can use bare keys in Arabic script, but only without ZWJs", this might well cause all kind of parsing errors, since people will have a hard time writing keys in these scripts without this character appearing.

Yeah, maybe; it's really hard for me to judge to what degree it's "needed" and "commonly used" and to what degree it's "a feature offered, but not commonly used".

I loaded the Arabic Wikipedia on Mars (just a random featured/long article), and it seems to contain only a single ZWJ (in the title to the Simple English version). I checked a few other random articles and some pages with a lot of text on https://www.my.gov.sa after switching the language to Arabic, and can't find ZWJ there either. Go certainly doesn't allow ZWJ (I tested that), and I can't find any complaints on the issue tracker, mailing list, or anywhere else.

That said, Go != TOML and the context for both is different, and TR-31 contains special rules for handling ZWJ, and I suppose this is an argument for both sides here: "ZWJ is needed in some contexts, so it must be allowed" as well as "ZWJ can be confusing, so we need to restrict where it can appear".

In conclusion: further research needed if it's decided to spend effort on this in the first place. The same applies to variation selectors.

Better to be liberal in what you accept, esp when it comes to scripts in Unicode.

Yeah, I don't really agree with that. "Postel's law" has been widely criticized over the years and I'm hardly the first/only to disagree with it; I'd say it's fair to state that it's pretty controversial overall. It was framed in a very different context, and in a very different world; historically it made a bit more sense due to standards often written up after the implementations, unclear/underspecified standards, harder to actually read the standards so many didn't, "cowboy coding" being the norm, etc. much of that applies a lot less today, and IMHO it doesn't apply to TOML or Unicode.

But my main issue with this is: it doesn't really engage with my concern, which can be summarized as "I feel this has the potential to cause a great deal of confusion, so I think it's better to be conservative initially, and perhaps correct it later if need be". If you want to say "I don't think people will end up being confused" or "I think it's an okay trade-off that people will get confused" then fair enough, as that engages with the stated concerns. But this doesn't really.

From the discussion above, I think the conclusion would lean towards inclusion of the extra range

To be honest, I'd really like some other views on this as well; thus far only three people commented on this.

I realize you might think I'm stubborn and difficult here, but I promise you I'm really not trying to be. I spent a lot of time looking at this over the last few days (which also included considering "is it really worth everyone's time and energy banging on about this?"), and I think this has a huge potential to bite us and people using TOML in the ass. If it was only a matter of "I think doing it like this is nicer" or "I don't like it" I wouldn't have cared to much; I don't like to bikeshed over details and generally "whatever works as long as it's not completely atrocious" is fine with me.

Either way, I probably said everything I wanted to say, so I'll leave it at that for a while, giving other people a chance to catch up, reply, vote, etc.

arp242 commented 1 year ago

Posting as a separate comment for votes, I think the core questions are essentially:

  1. Do we want to spend effort to reduce the potential for confusion?
  2. If so, how?

Point "2" has a lot of subpoints, but if the answer to "1" is a "no" then it's pointless to even discuss it.

People can vote on this comment (not using thumbs to avoid ambiguity):

abelbraaksma commented 1 year ago

I feel this has the potential to cause a great deal of confusion, so I think it's better to be conservative initially, and perhaps correct it later if need be

Yeah, but that sword has two edges: it’s similarly confusing if certain names cannot be expressed. If someone wants to use ZWJ, they will typically know what they’re doing. The majority of people will stay away from it, simply because it’s never come up with naming identifiers.

Quote:

The zero-width joiner (ZWJ) is a non-printing character used in the computerized typesetting of some complex scripts such as the Arabic script or any Indic script. Sometimes the Roman script is to be counted as complex, e.g. when using a Fraktur typeface. When placed between two characters that would otherwise not be connected, a ZWJ causes them to be printed in their connected forms.

epage commented 5 months ago

If this is one of the remaining blockers for 1.1.0-rc0, what if we instead defer bare keys to 1.2?

eksortso commented 5 months ago

At this point, I'm inclined to agree, and to make extending the allowable bare keys a primary objective for the future TOML 1.2.0. No offense to all the hard work put forward to make this viable, but while @pradyunsg is still MIA, we should slim down for now to get him back here for a while, work out how to proceed with the standards project, and get 1.1.0-rc1 out the door.

So let's save this issue, and all the other open issues regarding the extension of bare keys, for after 1.1.0 is released, then hit it full-bore with the best solution we can devise, with a scheduled date for release and a dedicated core team surrounding the standard and dealing with day-to-day matters.

Sorry @ChristianSi, I know this is a bitter pill to swallow, but we've waited too long, and we know what else needs done right now.

ChristianSi commented 5 months ago

I don't really see any serious blockers, neither this or anything else. But, as @eksortso has already mentioned, there hasn't been a working maintainer for the last few months (at least), so the project is effectively stuck.

@eksortso If you have an idea on how to solve this, I'd be interested to hear it!

abelbraaksma commented 2 months ago

@ChristianSi wasn't there an issue not so long ago to assign a new maintainer? @pradyunsg, would you allow another maintainer to the team?