Closed marzer closed 2 years ago
I wholeheartedly support this. Identifiers in non-English languages should not be discriminated against in TOML (and having to put them in quotes is a form of discrimination). [Note: the following refers to the original form of the proposal, which since then has been considerably extended.] Admittedly, with this proposal this would still only be the case for languages using the Latin script, but not for Russian, Arabic, Chinese etc. But it would still be a step in the right direction – and I see that there might be issues with allowing, say, Cyrillic letters, because they might be used to spoof a key that looks like ASCII but actually isn't. With Latin diacritics, this risk is much lower.
For completeness, I'd suggest to also support Latin Extended-B (Pan-Nigerian alphabet, Pinyin, Romanian), Latin Extended Additional (Vietnamese), and Latin Extended-C (Shona, a Bantu language).
So there would be no simple rule for human writer to decide if quotes are necessary.
Even good old C++ supports unicode characters in identifiers on modern compilers.
Do "modern C++ compilers" support most of Unicode, or only chosen subset of Latin Extended?
@ChristianSi Certainly there's lots of additional characters we could add. My list of suggestions was in no way meant to be exhaustive and I'm hoping that a more useful set of ranges is borne out of discussion. Being an Australian who only speaks English does limit my perspective a bit here!
@lmna
So there would be no simple rule for human writer to decide if quotes are necessary.
Technically-speaking the rule could be: quotes if you need whitespace, an escape code, or a TOML-reserved character, otherwise anything goes. My feeling is that requiring users to think about this at all is getting away from the design goals of TOML and drifting too far into Think-Like-A-Programmer territory, which risks defeating the purpose of a simple config file format that intends to "just work" the way people would expect in the layman case.
Do "modern C++ compilers" support most of Unicode, or only chosen subset of Latin Extended?
Absolutely no idea. Here's a live demo of some Unicode on Clang, GCC and MSVC; I encourage you to experiment. I'm certain if we needed a more definitive answer we could read the compiler source code (Clang or GCC) or ask the relevant developers (MSVC).
I'm on board.
I like the way Python handles alphabetical characters:
Alphabetic characters are those characters defined in the Unicode character database as “Letter”
That should be do-able for us, but I wonder how could complicate TOML implementations which are currently just doing a simple ASCII-value-range check. This also means we'd have to add a lot to TOML's ABNF.
@pradyunsg: I too would be fine with saying "Bare keys may contain arbitrary Unicode letters as well as ASCII digits, underscores, and dashes". But in effect this would likely mean that implementations would have to depend on some kind of Unicode library – easy in Python, where the isalpha()
check is built in, probably not so easy in some other languages.
An advantage of @marzer 's original proposal, or my somewhat modified one, is that it would be easy to enumerate the affected ranges manually, in the ABNF or in code. With arbitrary Unicode letters this likely becomes effectively impossible – especially since the ranges would have to be extended with each new version of the Unicode standard.
But, of course, we might also decide to allow letters in arbitrary languages and scripts, not just Latin-based ones, and accept the Unicode dependency.
Here's a listing of all the Unicode Character Categories and all the characters that belong to each one. Unsurprisingly, it's quite long!
@ChristianSi Given that TOML is supposed to be UTF-8 I'm inclined to think that requiring implementations use unicode machinery, hand-rolled or otherwise, isn't really a big deal, regardless of the direction this proposal takes.
@pradyunsg I too like the python approach, and I don't think it would be too hard to implement. I'm currently writing a TOML library of my own and I'd be happy to build it into my utf-8 decoder as a proof-of-concept, if that's useful.
@ChristianSi I wrote a script to scrape letter characters from the website you linked, sort them and list them as ranges. If you omit the letter categories Lm
and Lo
the set of character ranges seems totally manageable:
removed, see below
@pradyunsg I don't know much about ABNF's but if characters can be expressed as ranges this wouldn't be much work.
@marzer Interesting. But the problem is that nearly all Unicode letters are in the "Letter, other" (Lo) category – 97 percent according to Wikipedia. Ignoring them, you only get the letters in alphabets that distinguish between upper case and lower case forms – Latin, Cyrillic, Greek, and a few others. But most writing systems don't – e.g. those used to write Chinese, Arabic, Hebrew, Korean, and certain Indian languages such as Tamil and Telugu know no such distinction. Hence their letters go into the "other" category.
What happens when you consider all letters? I suppose ranges become a bit unwieldy?
@marzer Ooo-kay. But it seems that website is outdated or incomplete. It only lists 16249 Other Letters, while Wikipedia says there should be 121414. This list seems complete.
Moreover, you should also add the Lm (Letter, modifier) category – there are only 259 of them.
@ChristianSi Ah, good find. I'll update the script later tonight and see how it looks.
Thanks for exploring this @ChristianSi and @marzer! ^>^
Marking this as a post-1.0 change, since I imagine this relaxation would not make any valid documents invalid -- thus, we can augment this in a non-major version bump.
@ChristianSi Ok, I updated the script to scrape directly from the unicode consortium's character database and amended it to include all of the letter characters, and it looks like this:
Added 125634 codepoints from 5 categories.
Ranges:
0x41 => 0x5A
0x61 => 0x7A
0xAA
0xB5
0xBA
0xC0 => 0xD6
0xD8 => 0xF6
0xF8 => 0x2C1
0x2C6 => 0x2D1
0x2E0 => 0x2E4
0x2EC
0x2EE
0x370 => 0x374
0x376 => 0x377
0x37A => 0x37D
0x37F
0x386
0x388 => 0x38A
0x38C
0x38E => 0x3A1
0x3A3 => 0x3F5
0x3F7 => 0x481
0x48A => 0x52F
0x531 => 0x556
0x559
0x560 => 0x588
0x5D0 => 0x5EA
0x5EF => 0x5F2
0x620 => 0x64A
0x66E => 0x66F
0x671 => 0x6D3
0x6D5
0x6E5 => 0x6E6
0x6EE => 0x6EF
0x6FA => 0x6FC
0x6FF
0x710
0x712 => 0x72F
0x74D => 0x7A5
0x7B1
0x7CA => 0x7EA
0x7F4 => 0x7F5
0x7FA
0x800 => 0x815
0x81A
0x824
0x828
0x840 => 0x858
0x860 => 0x86A
0x8A0 => 0x8B4
0x8B6 => 0x8BD
0x904 => 0x939
0x93D
0x950
0x958 => 0x961
0x971 => 0x980
0x985 => 0x98C
0x98F => 0x990
0x993 => 0x9A8
0x9AA => 0x9B0
0x9B2
0x9B6 => 0x9B9
0x9BD
0x9CE
0x9DC => 0x9DD
0x9DF => 0x9E1
0x9F0 => 0x9F1
0x9FC
0xA05 => 0xA0A
0xA0F => 0xA10
0xA13 => 0xA28
0xA2A => 0xA30
0xA32 => 0xA33
0xA35 => 0xA36
0xA38 => 0xA39
0xA59 => 0xA5C
0xA5E
0xA72 => 0xA74
0xA85 => 0xA8D
0xA8F => 0xA91
0xA93 => 0xAA8
0xAAA => 0xAB0
0xAB2 => 0xAB3
0xAB5 => 0xAB9
0xABD
0xAD0
0xAE0 => 0xAE1
0xAF9
0xB05 => 0xB0C
0xB0F => 0xB10
0xB13 => 0xB28
0xB2A => 0xB30
0xB32 => 0xB33
0xB35 => 0xB39
0xB3D
0xB5C => 0xB5D
0xB5F => 0xB61
0xB71
0xB83
0xB85 => 0xB8A
0xB8E => 0xB90
0xB92 => 0xB95
0xB99 => 0xB9A
0xB9C
0xB9E => 0xB9F
0xBA3 => 0xBA4
0xBA8 => 0xBAA
0xBAE => 0xBB9
0xBD0
0xC05 => 0xC0C
0xC0E => 0xC10
0xC12 => 0xC28
0xC2A => 0xC39
0xC3D
0xC58 => 0xC5A
0xC60 => 0xC61
0xC80
0xC85 => 0xC8C
0xC8E => 0xC90
0xC92 => 0xCA8
0xCAA => 0xCB3
0xCB5 => 0xCB9
0xCBD
0xCDE
0xCE0 => 0xCE1
0xCF1 => 0xCF2
0xD05 => 0xD0C
0xD0E => 0xD10
0xD12 => 0xD3A
0xD3D
0xD4E
0xD54 => 0xD56
0xD5F => 0xD61
0xD7A => 0xD7F
0xD85 => 0xD96
0xD9A => 0xDB1
0xDB3 => 0xDBB
0xDBD
0xDC0 => 0xDC6
0xE01 => 0xE30
0xE32 => 0xE33
0xE40 => 0xE46
0xE81 => 0xE82
0xE84
0xE86 => 0xE8A
0xE8C => 0xEA3
0xEA5
0xEA7 => 0xEB0
0xEB2 => 0xEB3
0xEBD
0xEC0 => 0xEC4
0xEC6
0xEDC => 0xEDF
0xF00
0xF40 => 0xF47
0xF49 => 0xF6C
0xF88 => 0xF8C
0x1000 => 0x102A
0x103F
0x1050 => 0x1055
0x105A => 0x105D
0x1061
0x1065 => 0x1066
0x106E => 0x1070
0x1075 => 0x1081
0x108E
0x10A0 => 0x10C5
0x10C7
0x10CD
0x10D0 => 0x10FA
0x10FC => 0x1248
0x124A => 0x124D
0x1250 => 0x1256
0x1258
0x125A => 0x125D
0x1260 => 0x1288
0x128A => 0x128D
0x1290 => 0x12B0
0x12B2 => 0x12B5
0x12B8 => 0x12BE
0x12C0
0x12C2 => 0x12C5
0x12C8 => 0x12D6
0x12D8 => 0x1310
0x1312 => 0x1315
0x1318 => 0x135A
0x1380 => 0x138F
0x13A0 => 0x13F5
0x13F8 => 0x13FD
0x1401 => 0x166C
0x166F => 0x167F
0x1681 => 0x169A
0x16A0 => 0x16EA
0x16F1 => 0x16F8
0x1700 => 0x170C
0x170E => 0x1711
0x1720 => 0x1731
0x1740 => 0x1751
0x1760 => 0x176C
0x176E => 0x1770
0x1780 => 0x17B3
0x17D7
0x17DC
0x1820 => 0x1878
0x1880 => 0x1884
0x1887 => 0x18A8
0x18AA
0x18B0 => 0x18F5
0x1900 => 0x191E
0x1950 => 0x196D
0x1970 => 0x1974
0x1980 => 0x19AB
0x19B0 => 0x19C9
0x1A00 => 0x1A16
0x1A20 => 0x1A54
0x1AA7
0x1B05 => 0x1B33
0x1B45 => 0x1B4B
0x1B83 => 0x1BA0
0x1BAE => 0x1BAF
0x1BBA => 0x1BE5
0x1C00 => 0x1C23
0x1C4D => 0x1C4F
0x1C5A => 0x1C7D
0x1C80 => 0x1C88
0x1C90 => 0x1CBA
0x1CBD => 0x1CBF
0x1CE9 => 0x1CEC
0x1CEE => 0x1CF3
0x1CF5 => 0x1CF6
0x1CFA
0x1D00 => 0x1DBF
0x1E00 => 0x1F15
0x1F18 => 0x1F1D
0x1F20 => 0x1F45
0x1F48 => 0x1F4D
0x1F50 => 0x1F57
0x1F59
0x1F5B
0x1F5D
0x1F5F => 0x1F7D
0x1F80 => 0x1FB4
0x1FB6 => 0x1FBC
0x1FBE
0x1FC2 => 0x1FC4
0x1FC6 => 0x1FCC
0x1FD0 => 0x1FD3
0x1FD6 => 0x1FDB
0x1FE0 => 0x1FEC
0x1FF2 => 0x1FF4
0x1FF6 => 0x1FFC
0x2071
0x207F
0x2090 => 0x209C
0x2102
0x2107
0x210A => 0x2113
0x2115
0x2119 => 0x211D
0x2124
0x2126
0x2128
0x212A => 0x212D
0x212F => 0x2139
0x213C => 0x213F
0x2145 => 0x2149
0x214E
0x2183 => 0x2184
0x2C00 => 0x2C2E
0x2C30 => 0x2C5E
0x2C60 => 0x2CE4
0x2CEB => 0x2CEE
0x2CF2 => 0x2CF3
0x2D00 => 0x2D25
0x2D27
0x2D2D
0x2D30 => 0x2D67
0x2D6F
0x2D80 => 0x2D96
0x2DA0 => 0x2DA6
0x2DA8 => 0x2DAE
0x2DB0 => 0x2DB6
0x2DB8 => 0x2DBE
0x2DC0 => 0x2DC6
0x2DC8 => 0x2DCE
0x2DD0 => 0x2DD6
0x2DD8 => 0x2DDE
0x2E2F
0x3005 => 0x3006
0x3031 => 0x3035
0x303B => 0x303C
0x3041 => 0x3096
0x309D => 0x309F
0x30A1 => 0x30FA
0x30FC => 0x30FF
0x3105 => 0x312F
0x3131 => 0x318E
0x31A0 => 0x31BA
0x31F0 => 0x31FF
0x3400 => 0x4DB4
0x4E00 => 0x9FEE
0xA000 => 0xA48C
0xA4D0 => 0xA4FD
0xA500 => 0xA60C
0xA610 => 0xA61F
0xA62A => 0xA62B
0xA640 => 0xA66E
0xA67F => 0xA69D
0xA6A0 => 0xA6E5
0xA717 => 0xA71F
0xA722 => 0xA788
0xA78B => 0xA7BF
0xA7C2 => 0xA7C6
0xA7F7 => 0xA801
0xA803 => 0xA805
0xA807 => 0xA80A
0xA80C => 0xA822
0xA840 => 0xA873
0xA882 => 0xA8B3
0xA8F2 => 0xA8F7
0xA8FB
0xA8FD => 0xA8FE
0xA90A => 0xA925
0xA930 => 0xA946
0xA960 => 0xA97C
0xA984 => 0xA9B2
0xA9CF
0xA9E0 => 0xA9E4
0xA9E6 => 0xA9EF
0xA9FA => 0xA9FE
0xAA00 => 0xAA28
0xAA40 => 0xAA42
0xAA44 => 0xAA4B
0xAA60 => 0xAA76
0xAA7A
0xAA7E => 0xAAAF
0xAAB1
0xAAB5 => 0xAAB6
0xAAB9 => 0xAABD
0xAAC0
0xAAC2
0xAADB => 0xAADD
0xAAE0 => 0xAAEA
0xAAF2 => 0xAAF4
0xAB01 => 0xAB06
0xAB09 => 0xAB0E
0xAB11 => 0xAB16
0xAB20 => 0xAB26
0xAB28 => 0xAB2E
0xAB30 => 0xAB5A
0xAB5C => 0xAB67
0xAB70 => 0xABE2
0xAC00 => 0xD7A2
0xD7B0 => 0xD7C6
0xD7CB => 0xD7FB
0xF900 => 0xFA6D
0xFA70 => 0xFAD9
0xFB00 => 0xFB06
0xFB13 => 0xFB17
0xFB1D
0xFB1F => 0xFB28
0xFB2A => 0xFB36
0xFB38 => 0xFB3C
0xFB3E
0xFB40 => 0xFB41
0xFB43 => 0xFB44
0xFB46 => 0xFBB1
0xFBD3 => 0xFD3D
0xFD50 => 0xFD8F
0xFD92 => 0xFDC7
0xFDF0 => 0xFDFB
0xFE70 => 0xFE74
0xFE76 => 0xFEFC
0xFF21 => 0xFF3A
0xFF41 => 0xFF5A
0xFF66 => 0xFFBE
0xFFC2 => 0xFFC7
0xFFCA => 0xFFCF
0xFFD2 => 0xFFD7
0xFFDA => 0xFFDC
0x10000 => 0x1000B
0x1000D => 0x10026
0x10028 => 0x1003A
0x1003C => 0x1003D
0x1003F => 0x1004D
0x10050 => 0x1005D
0x10080 => 0x100FA
0x10280 => 0x1029C
0x102A0 => 0x102D0
0x10300 => 0x1031F
0x1032D => 0x10340
0x10342 => 0x10349
0x10350 => 0x10375
0x10380 => 0x1039D
0x103A0 => 0x103C3
0x103C8 => 0x103CF
0x10400 => 0x1049D
0x104B0 => 0x104D3
0x104D8 => 0x104FB
0x10500 => 0x10527
0x10530 => 0x10563
0x10600 => 0x10736
0x10740 => 0x10755
0x10760 => 0x10767
0x10800 => 0x10805
0x10808
0x1080A => 0x10835
0x10837 => 0x10838
0x1083C
0x1083F => 0x10855
0x10860 => 0x10876
0x10880 => 0x1089E
0x108E0 => 0x108F2
0x108F4 => 0x108F5
0x10900 => 0x10915
0x10920 => 0x10939
0x10980 => 0x109B7
0x109BE => 0x109BF
0x10A00
0x10A10 => 0x10A13
0x10A15 => 0x10A17
0x10A19 => 0x10A35
0x10A60 => 0x10A7C
0x10A80 => 0x10A9C
0x10AC0 => 0x10AC7
0x10AC9 => 0x10AE4
0x10B00 => 0x10B35
0x10B40 => 0x10B55
0x10B60 => 0x10B72
0x10B80 => 0x10B91
0x10C00 => 0x10C48
0x10C80 => 0x10CB2
0x10CC0 => 0x10CF2
0x10D00 => 0x10D23
0x10F00 => 0x10F1C
0x10F27
0x10F30 => 0x10F45
0x10FE0 => 0x10FF6
0x11003 => 0x11037
0x11083 => 0x110AF
0x110D0 => 0x110E8
0x11103 => 0x11126
0x11144
0x11150 => 0x11172
0x11176
0x11183 => 0x111B2
0x111C1 => 0x111C4
0x111DA
0x111DC
0x11200 => 0x11211
0x11213 => 0x1122B
0x11280 => 0x11286
0x11288
0x1128A => 0x1128D
0x1128F => 0x1129D
0x1129F => 0x112A8
0x112B0 => 0x112DE
0x11305 => 0x1130C
0x1130F => 0x11310
0x11313 => 0x11328
0x1132A => 0x11330
0x11332 => 0x11333
0x11335 => 0x11339
0x1133D
0x11350
0x1135D => 0x11361
0x11400 => 0x11434
0x11447 => 0x1144A
0x1145F
0x11480 => 0x114AF
0x114C4 => 0x114C5
0x114C7
0x11580 => 0x115AE
0x115D8 => 0x115DB
0x11600 => 0x1162F
0x11644
0x11680 => 0x116AA
0x116B8
0x11700 => 0x1171A
0x11800 => 0x1182B
0x118A0 => 0x118DF
0x118FF
0x119A0 => 0x119A7
0x119AA => 0x119D0
0x119E1
0x119E3
0x11A00
0x11A0B => 0x11A32
0x11A3A
0x11A50
0x11A5C => 0x11A89
0x11A9D
0x11AC0 => 0x11AF8
0x11C00 => 0x11C08
0x11C0A => 0x11C2E
0x11C40
0x11C72 => 0x11C8F
0x11D00 => 0x11D06
0x11D08 => 0x11D09
0x11D0B => 0x11D30
0x11D46
0x11D60 => 0x11D65
0x11D67 => 0x11D68
0x11D6A => 0x11D89
0x11D98
0x11EE0 => 0x11EF2
0x12000 => 0x12399
0x12480 => 0x12543
0x13000 => 0x1342E
0x14400 => 0x14646
0x16800 => 0x16A38
0x16A40 => 0x16A5E
0x16AD0 => 0x16AED
0x16B00 => 0x16B2F
0x16B40 => 0x16B43
0x16B63 => 0x16B77
0x16B7D => 0x16B8F
0x16E40 => 0x16E7F
0x16F00 => 0x16F4A
0x16F50
0x16F93 => 0x16F9F
0x16FE0 => 0x16FE1
0x16FE3
0x17000 => 0x187F6
0x18800 => 0x18AF2
0x1B000 => 0x1B11E
0x1B150 => 0x1B152
0x1B164 => 0x1B167
0x1B170 => 0x1B2FB
0x1BC00 => 0x1BC6A
0x1BC70 => 0x1BC7C
0x1BC80 => 0x1BC88
0x1BC90 => 0x1BC99
0x1D400 => 0x1D454
0x1D456 => 0x1D49C
0x1D49E => 0x1D49F
0x1D4A2
0x1D4A5 => 0x1D4A6
0x1D4A9 => 0x1D4AC
0x1D4AE => 0x1D4B9
0x1D4BB
0x1D4BD => 0x1D4C3
0x1D4C5 => 0x1D505
0x1D507 => 0x1D50A
0x1D50D => 0x1D514
0x1D516 => 0x1D51C
0x1D51E => 0x1D539
0x1D53B => 0x1D53E
0x1D540 => 0x1D544
0x1D546
0x1D54A => 0x1D550
0x1D552 => 0x1D6A5
0x1D6A8 => 0x1D6C0
0x1D6C2 => 0x1D6DA
0x1D6DC => 0x1D6FA
0x1D6FC => 0x1D714
0x1D716 => 0x1D734
0x1D736 => 0x1D74E
0x1D750 => 0x1D76E
0x1D770 => 0x1D788
0x1D78A => 0x1D7A8
0x1D7AA => 0x1D7C2
0x1D7C4 => 0x1D7CB
0x1E100 => 0x1E12C
0x1E137 => 0x1E13D
0x1E14E
0x1E2C0 => 0x1E2EB
0x1E800 => 0x1E8C4
0x1E900 => 0x1E943
0x1E94B
0x1EE00 => 0x1EE03
0x1EE05 => 0x1EE1F
0x1EE21 => 0x1EE22
0x1EE24
0x1EE27
0x1EE29 => 0x1EE32
0x1EE34 => 0x1EE37
0x1EE39
0x1EE3B
0x1EE42
0x1EE47
0x1EE49
0x1EE4B
0x1EE4D => 0x1EE4F
0x1EE51 => 0x1EE52
0x1EE54
0x1EE57
0x1EE59
0x1EE5B
0x1EE5D
0x1EE5F
0x1EE61 => 0x1EE62
0x1EE64
0x1EE67 => 0x1EE6A
0x1EE6C => 0x1EE72
0x1EE74 => 0x1EE77
0x1EE79 => 0x1EE7C
0x1EE7E
0x1EE80 => 0x1EE89
0x1EE8B => 0x1EE9B
0x1EEA1 => 0x1EEA3
0x1EEA5 => 0x1EEA9
0x1EEAB => 0x1EEBB
0x20000 => 0x2A6D5
0x2A700 => 0x2B733
0x2B740 => 0x2B81C
0x2B820 => 0x2CEA0
0x2CEB0 => 0x2EBDF
0x2F800 => 0x2FA1D
Not really any worse than before, even considering it's 125634 characters.
@marzer With what you provided, a PR could be prepared fairly quickly. Could you write that list similarly to how ucschar
is written in RFC 3987? You don't need to wrap it; we can do that. But instead of e.g. 0x2F800 => 0x2FA1D
, can you write %x2F800-2FA1D
instead?
For reference, the part of RFC 3987 I'm referring to looks like this:
ucschar = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
/ %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
/ %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
/ %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
/ %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
/ %xD0000-DFFFD / %xE1000-EFFFD
Thanks a lot for your efforts, @marzer ! That looks good so far and indeed manageable, but there are a few complications we have missed so far. I looked at what Python 3 and JavaScript allow in identifiers.
In addition to the five Letter categories we have already, they allow "Letter Number" (Nl) anywhere in an identifier and "Decimal Number" (Nd) anywhere, except at the start. TOML already allows all-numeric keys (they never occur in a position where they can be confused with actual numbers) so I'd consider it reasonable to allow both these categories anywhere in a key – people using Bengali letters in keys might, for example, reasonably expect to be able to use Bengali digits as well. Together they comprise less than 900 characters, so adding them should be quite manageable.
The final Number category (Other Number – No) is not allowed in identifiers in either language.
Moreover, both languages allow anywhere, except at the start, "Nonspacing Mark" (Mn) and "Spacing Mark" (Mc). Now it's important to understand that in Unicode, Marks (Mx categories) are always combining characters – they become logically attached to the preceding character and modify it. For example, Mn contains the "combining grave accent" which goes over the preceding letter and modifies it; Mc contains various Bengali vowel signs which likewise modify the preceding (supposedly Bengali) letter.
Hence it seems indeed important that we support these two categories too, since they are necessary to write certain words in certain languages – without them, support for multilingual bare keys would be incomplete and people might get odd error messages. It's also important that we must NOT allow them at the start of a bare key, since otherwise they would try to modify the preceding non-key character (likely a newline, space, or [
or .
in table names or dotted keys) which would be nonsensical and blur the boundary at the start of a key. Together these categories have about 2250 entries, which likewise is manageable.
Finally, both JS and Python allow, except at the start, Connector Punctuation (Pc). That's a very short category with just 10 entries, including the underscore, which we allow already. I don't have strong feelings regarding this category, but would rather tend NOT to allow it in bare keys – we already have underscores and dashes as connectors, and, for example, the Centreline Low Line (﹎) with a tiny dot in the middle could theoretically be confused with the dots that actually separate key elements in hierarchical table names.
So, to summarize, I'd propose to additionally allow Nl and Nd anywhere in a bare key, and Mn and Mc anywhere except as first character (or code point, to be more exact).
In the README, we could then say:
Bare keys may contain arbitrary Unicode letters and digits as well as ASCII underscores (
_
) and dashes (-
). (Technically, code points belonging to the Unicode categories Ll, Lm, Lo, Lt, Lu, Nd and Nl are allowed anywhere in a bare key, and those belonging to the categories Mc and Mn are allowed anywhere except as first code point.)
@ChristianSi LGTM. This proposal has significantly broadened in scope from my original thought bubble, but definitely for the better.
Can the Mx codepoints appear consecutively? If not, we'd also need to clarify that codepoints from Mx categories cannot appear at the beginning of a key and immediately following another Mx codepoint.
@marzer:
This proposal has significantly broadened in scope from my original thought bubble, but definitely for the better.
Indeed!
Yes, consecutive Mx codepoints are allowed – they all modify the preceding letter, e.g. by placing an acute accent above and an ogonek below it. That's necessary for some languages, such as Navajo.
Alright I've updated the issue text to better reflect the current state of the discussion, as well as including links to my proof-of-concept implementation. I've also updated the script to generate the ABNF notation for the three relevant 'super-categories' of codepoints, which generates this:
; unicode codepoints from categories Ll, Lm, Lo, Lt, Lu
letters = %x41-5A / %x61-7A / %xAA / %xB5 /
%xBA / %xC0-D6 / %xD8-F6 / %xF8-2C1 /
%x2C6-2D1 / %x2E0-2E4 / %x2EC / %x2EE /
%x370-374 / %x376-377 / %x37A-37D / %x37F /
%x386 / %x388-38A / %x38C / %x38E-3A1 /
%x3A3-3F5 / %x3F7-481 / %x48A-52F / %x531-556 /
%x559 / %x560-588 / %x5D0-5EA / %x5EF-5F2 /
%x620-64A / %x66E-66F / %x671-6D3 / %x6D5 /
%x6E5-6E6 / %x6EE-6EF / %x6FA-6FC / %x6FF /
%x710 / %x712-72F / %x74D-7A5 / %x7B1 /
%x7CA-7EA / %x7F4-7F5 / %x7FA / %x800-815 /
%x81A / %x824 / %x828 / %x840-858 /
%x860-86A / %x8A0-8B4 / %x8B6-8C7 / %x904-939 /
%x93D / %x950 / %x958-961 / %x971-980 /
%x985-98C / %x98F-990 / %x993-9A8 / %x9AA-9B0 /
%x9B2 / %x9B6-9B9 / %x9BD / %x9CE /
%x9DC-9DD / %x9DF-9E1 / %x9F0-9F1 / %x9FC /
%xA05-A0A / %xA0F-A10 / %xA13-A28 / %xA2A-A30 /
%xA32-A33 / %xA35-A36 / %xA38-A39 / %xA59-A5C /
%xA5E / %xA72-A74 / %xA85-A8D / %xA8F-A91 /
%xA93-AA8 / %xAAA-AB0 / %xAB2-AB3 / %xAB5-AB9 /
%xABD / %xAD0 / %xAE0-AE1 / %xAF9 /
%xB05-B0C / %xB0F-B10 / %xB13-B28 / %xB2A-B30 /
%xB32-B33 / %xB35-B39 / %xB3D / %xB5C-B5D /
%xB5F-B61 / %xB71 / %xB83 / %xB85-B8A /
%xB8E-B90 / %xB92-B95 / %xB99-B9A / %xB9C /
%xB9E-B9F / %xBA3-BA4 / %xBA8-BAA / %xBAE-BB9 /
%xBD0 / %xC05-C0C / %xC0E-C10 / %xC12-C28 /
%xC2A-C39 / %xC3D / %xC58-C5A / %xC60-C61 /
%xC80 / %xC85-C8C / %xC8E-C90 / %xC92-CA8 /
%xCAA-CB3 / %xCB5-CB9 / %xCBD / %xCDE /
%xCE0-CE1 / %xCF1-CF2 / %xD04-D0C / %xD0E-D10 /
%xD12-D3A / %xD3D / %xD4E / %xD54-D56 /
%xD5F-D61 / %xD7A-D7F / %xD85-D96 / %xD9A-DB1 /
%xDB3-DBB / %xDBD / %xDC0-DC6 / %xE01-E30 /
%xE32-E33 / %xE40-E46 / %xE81-E82 / %xE84 /
%xE86-E8A / %xE8C-EA3 / %xEA5 / %xEA7-EB0 /
%xEB2-EB3 / %xEBD / %xEC0-EC4 / %xEC6 /
%xEDC-EDF / %xF00 / %xF40-F47 / %xF49-F6C /
%xF88-F8C / %x1000-102A / %x103F / %x1050-1055 /
%x105A-105D / %x1061 / %x1065-1066 / %x106E-1070 /
%x1075-1081 / %x108E / %x10A0-10C5 / %x10C7 /
%x10CD / %x10D0-10FA / %x10FC-1248 / %x124A-124D /
%x1250-1256 / %x1258 / %x125A-125D / %x1260-1288 /
%x128A-128D / %x1290-12B0 / %x12B2-12B5 / %x12B8-12BE /
%x12C0 / %x12C2-12C5 / %x12C8-12D6 / %x12D8-1310 /
%x1312-1315 / %x1318-135A / %x1380-138F / %x13A0-13F5 /
%x13F8-13FD / %x1401-166C / %x166F-167F / %x1681-169A /
%x16A0-16EA / %x16F1-16F8 / %x1700-170C / %x170E-1711 /
%x1720-1731 / %x1740-1751 / %x1760-176C / %x176E-1770 /
%x1780-17B3 / %x17D7 / %x17DC / %x1820-1878 /
%x1880-1884 / %x1887-18A8 / %x18AA / %x18B0-18F5 /
%x1900-191E / %x1950-196D / %x1970-1974 / %x1980-19AB /
%x19B0-19C9 / %x1A00-1A16 / %x1A20-1A54 / %x1AA7 /
%x1B05-1B33 / %x1B45-1B4B / %x1B83-1BA0 / %x1BAE-1BAF /
%x1BBA-1BE5 / %x1C00-1C23 / %x1C4D-1C4F / %x1C5A-1C7D /
%x1C80-1C88 / %x1C90-1CBA / %x1CBD-1CBF / %x1CE9-1CEC /
%x1CEE-1CF3 / %x1CF5-1CF6 / %x1CFA / %x1D00-1DBF /
%x1E00-1F15 / %x1F18-1F1D / %x1F20-1F45 / %x1F48-1F4D /
%x1F50-1F57 / %x1F59 / %x1F5B / %x1F5D /
%x1F5F-1F7D / %x1F80-1FB4 / %x1FB6-1FBC / %x1FBE /
%x1FC2-1FC4 / %x1FC6-1FCC / %x1FD0-1FD3 / %x1FD6-1FDB /
%x1FE0-1FEC / %x1FF2-1FF4 / %x1FF6-1FFC / %x2071 /
%x207F / %x2090-209C / %x2102 / %x2107 /
%x210A-2113 / %x2115 / %x2119-211D / %x2124 /
%x2126 / %x2128 / %x212A-212D / %x212F-2139 /
%x213C-213F / %x2145-2149 / %x214E / %x2183-2184 /
%x2C00-2C2E / %x2C30-2C5E / %x2C60-2CE4 / %x2CEB-2CEE /
%x2CF2-2CF3 / %x2D00-2D25 / %x2D27 / %x2D2D /
%x2D30-2D67 / %x2D6F / %x2D80-2D96 / %x2DA0-2DA6 /
%x2DA8-2DAE / %x2DB0-2DB6 / %x2DB8-2DBE / %x2DC0-2DC6 /
%x2DC8-2DCE / %x2DD0-2DD6 / %x2DD8-2DDE / %x2E2F /
%x3005-3006 / %x3031-3035 / %x303B-303C / %x3041-3096 /
%x309D-309F / %x30A1-30FA / %x30FC-30FF / %x3105-312F /
%x3131-318E / %x31A0-31BF / %x31F0-31FF / %x3400-4DBF /
%x4E00-9FFC / %xA000-A48C / %xA4D0-A4FD / %xA500-A60C /
%xA610-A61F / %xA62A-A62B / %xA640-A66E / %xA67F-A69D /
%xA6A0-A6E5 / %xA717-A71F / %xA722-A788 / %xA78B-A7BF /
%xA7C2-A7CA / %xA7F5-A801 / %xA803-A805 / %xA807-A80A /
%xA80C-A822 / %xA840-A873 / %xA882-A8B3 / %xA8F2-A8F7 /
%xA8FB / %xA8FD-A8FE / %xA90A-A925 / %xA930-A946 /
%xA960-A97C / %xA984-A9B2 / %xA9CF / %xA9E0-A9E4 /
%xA9E6-A9EF / %xA9FA-A9FE / %xAA00-AA28 / %xAA40-AA42 /
%xAA44-AA4B / %xAA60-AA76 / %xAA7A / %xAA7E-AAAF /
%xAAB1 / %xAAB5-AAB6 / %xAAB9-AABD / %xAAC0 /
%xAAC2 / %xAADB-AADD / %xAAE0-AAEA / %xAAF2-AAF4 /
%xAB01-AB06 / %xAB09-AB0E / %xAB11-AB16 / %xAB20-AB26 /
%xAB28-AB2E / %xAB30-AB5A / %xAB5C-AB69 / %xAB70-ABE2 /
%xAC00-D7A3 / %xD7B0-D7C6 / %xD7CB-D7FB / %xF900-FA6D /
%xFA70-FAD9 / %xFB00-FB06 / %xFB13-FB17 / %xFB1D /
%xFB1F-FB28 / %xFB2A-FB36 / %xFB38-FB3C / %xFB3E /
%xFB40-FB41 / %xFB43-FB44 / %xFB46-FBB1 / %xFBD3-FD3D /
%xFD50-FD8F / %xFD92-FDC7 / %xFDF0-FDFB / %xFE70-FE74 /
%xFE76-FEFC / %xFF21-FF3A / %xFF41-FF5A / %xFF66-FFBE /
%xFFC2-FFC7 / %xFFCA-FFCF / %xFFD2-FFD7 / %xFFDA-FFDC /
%x10000-1000B / %x1000D-10026 / %x10028-1003A / %x1003C-1003D /
%x1003F-1004D / %x10050-1005D / %x10080-100FA / %x10280-1029C /
%x102A0-102D0 / %x10300-1031F / %x1032D-10340 / %x10342-10349 /
%x10350-10375 / %x10380-1039D / %x103A0-103C3 / %x103C8-103CF /
%x10400-1049D / %x104B0-104D3 / %x104D8-104FB / %x10500-10527 /
%x10530-10563 / %x10600-10736 / %x10740-10755 / %x10760-10767 /
%x10800-10805 / %x10808 / %x1080A-10835 / %x10837-10838 /
%x1083C / %x1083F-10855 / %x10860-10876 / %x10880-1089E /
%x108E0-108F2 / %x108F4-108F5 / %x10900-10915 / %x10920-10939 /
%x10980-109B7 / %x109BE-109BF / %x10A00 / %x10A10-10A13 /
%x10A15-10A17 / %x10A19-10A35 / %x10A60-10A7C / %x10A80-10A9C /
%x10AC0-10AC7 / %x10AC9-10AE4 / %x10B00-10B35 / %x10B40-10B55 /
%x10B60-10B72 / %x10B80-10B91 / %x10C00-10C48 / %x10C80-10CB2 /
%x10CC0-10CF2 / %x10D00-10D23 / %x10E80-10EA9 / %x10EB0-10EB1 /
%x10F00-10F1C / %x10F27 / %x10F30-10F45 / %x10FB0-10FC4 /
%x10FE0-10FF6 / %x11003-11037 / %x11083-110AF / %x110D0-110E8 /
%x11103-11126 / %x11144 / %x11147 / %x11150-11172 /
%x11176 / %x11183-111B2 / %x111C1-111C4 / %x111DA /
%x111DC / %x11200-11211 / %x11213-1122B / %x11280-11286 /
%x11288 / %x1128A-1128D / %x1128F-1129D / %x1129F-112A8 /
%x112B0-112DE / %x11305-1130C / %x1130F-11310 / %x11313-11328 /
%x1132A-11330 / %x11332-11333 / %x11335-11339 / %x1133D /
%x11350 / %x1135D-11361 / %x11400-11434 / %x11447-1144A /
%x1145F-11461 / %x11480-114AF / %x114C4-114C5 / %x114C7 /
%x11580-115AE / %x115D8-115DB / %x11600-1162F / %x11644 /
%x11680-116AA / %x116B8 / %x11700-1171A / %x11800-1182B /
%x118A0-118DF / %x118FF-11906 / %x11909 / %x1190C-11913 /
%x11915-11916 / %x11918-1192F / %x1193F / %x11941 /
%x119A0-119A7 / %x119AA-119D0 / %x119E1 / %x119E3 /
%x11A00 / %x11A0B-11A32 / %x11A3A / %x11A50 /
%x11A5C-11A89 / %x11A9D / %x11AC0-11AF8 / %x11C00-11C08 /
%x11C0A-11C2E / %x11C40 / %x11C72-11C8F / %x11D00-11D06 /
%x11D08-11D09 / %x11D0B-11D30 / %x11D46 / %x11D60-11D65 /
%x11D67-11D68 / %x11D6A-11D89 / %x11D98 / %x11EE0-11EF2 /
%x11FB0 / %x12000-12399 / %x12480-12543 / %x13000-1342E /
%x14400-14646 / %x16800-16A38 / %x16A40-16A5E / %x16AD0-16AED /
%x16B00-16B2F / %x16B40-16B43 / %x16B63-16B77 / %x16B7D-16B8F /
%x16E40-16E7F / %x16F00-16F4A / %x16F50 / %x16F93-16F9F /
%x16FE0-16FE1 / %x16FE3 / %x17000-187F7 / %x18800-18CD5 /
%x18D00-18D08 / %x1B000-1B11E / %x1B150-1B152 / %x1B164-1B167 /
%x1B170-1B2FB / %x1BC00-1BC6A / %x1BC70-1BC7C / %x1BC80-1BC88 /
%x1BC90-1BC99 / %x1D400-1D454 / %x1D456-1D49C / %x1D49E-1D49F /
%x1D4A2 / %x1D4A5-1D4A6 / %x1D4A9-1D4AC / %x1D4AE-1D4B9 /
%x1D4BB / %x1D4BD-1D4C3 / %x1D4C5-1D505 / %x1D507-1D50A /
%x1D50D-1D514 / %x1D516-1D51C / %x1D51E-1D539 / %x1D53B-1D53E /
%x1D540-1D544 / %x1D546 / %x1D54A-1D550 / %x1D552-1D6A5 /
%x1D6A8-1D6C0 / %x1D6C2-1D6DA / %x1D6DC-1D6FA / %x1D6FC-1D714 /
%x1D716-1D734 / %x1D736-1D74E / %x1D750-1D76E / %x1D770-1D788 /
%x1D78A-1D7A8 / %x1D7AA-1D7C2 / %x1D7C4-1D7CB / %x1E100-1E12C /
%x1E137-1E13D / %x1E14E / %x1E2C0-1E2EB / %x1E800-1E8C4 /
%x1E900-1E943 / %x1E94B / %x1EE00-1EE03 / %x1EE05-1EE1F /
%x1EE21-1EE22 / %x1EE24 / %x1EE27 / %x1EE29-1EE32 /
%x1EE34-1EE37 / %x1EE39 / %x1EE3B / %x1EE42 /
%x1EE47 / %x1EE49 / %x1EE4B / %x1EE4D-1EE4F /
%x1EE51-1EE52 / %x1EE54 / %x1EE57 / %x1EE59 /
%x1EE5B / %x1EE5D / %x1EE5F / %x1EE61-1EE62 /
%x1EE64 / %x1EE67-1EE6A / %x1EE6C-1EE72 / %x1EE74-1EE77 /
%x1EE79-1EE7C / %x1EE7E / %x1EE80-1EE89 / %x1EE8B-1EE9B /
%x1EEA1-1EEA3 / %x1EEA5-1EEA9 / %x1EEAB-1EEBB / %x20000-2A6DD /
%x2A700-2B734 / %x2B740-2B81D / %x2B820-2CEA1 / %x2CEB0-2EBE0 /
%x2F800-2FA1D / %x30000-3134A
; 131241 codepoints in total
; unicode codepoints from categories Nd, Nl
numbers = %x30-39 / %x660-669 / %x6F0-6F9 / %x7C0-7C9 /
%x966-96F / %x9E6-9EF / %xA66-A6F / %xAE6-AEF /
%xB66-B6F / %xBE6-BEF / %xC66-C6F / %xCE6-CEF /
%xD66-D6F / %xDE6-DEF / %xE50-E59 / %xED0-ED9 /
%xF20-F29 / %x1040-1049 / %x1090-1099 / %x16EE-16F0 /
%x17E0-17E9 / %x1810-1819 / %x1946-194F / %x19D0-19D9 /
%x1A80-1A89 / %x1A90-1A99 / %x1B50-1B59 / %x1BB0-1BB9 /
%x1C40-1C49 / %x1C50-1C59 / %x2160-2182 / %x2185-2188 /
%x3007 / %x3021-3029 / %x3038-303A / %xA620-A629 /
%xA6E6-A6EF / %xA8D0-A8D9 / %xA900-A909 / %xA9D0-A9D9 /
%xA9F0-A9F9 / %xAA50-AA59 / %xABF0-ABF9 / %xFF10-FF19 /
%x10140-10174 / %x10341 / %x1034A / %x103D1-103D5 /
%x104A0-104A9 / %x10D30-10D39 / %x11066-1106F / %x110F0-110F9 /
%x11136-1113F / %x111D0-111D9 / %x112F0-112F9 / %x11450-11459 /
%x114D0-114D9 / %x11650-11659 / %x116C0-116C9 / %x11730-11739 /
%x118E0-118E9 / %x11950-11959 / %x11C50-11C59 / %x11D50-11D59 /
%x11DA0-11DA9 / %x12400-1246E / %x16A60-16A69 / %x16B50-16B59 /
%x1D7CE-1D7FF / %x1E140-1E149 / %x1E2F0-1E2F9 / %x1E950-1E959 /
%x1FBF0-1FBF9
; 886 codepoints in total
; unicode codepoints from categories Mn, Mc
combining_marks = %x300-36F / %x483-487 / %x591-5BD / %x5BF /
%x5C1-5C2 / %x5C4-5C5 / %x5C7 / %x610-61A /
%x64B-65F / %x670 / %x6D6-6DC / %x6DF-6E4 /
%x6E7-6E8 / %x6EA-6ED / %x711 / %x730-74A /
%x7A6-7B0 / %x7EB-7F3 / %x7FD / %x816-819 /
%x81B-823 / %x825-827 / %x829-82D / %x859-85B /
%x8D3-8E1 / %x8E3-903 / %x93A-93C / %x93E-94F /
%x951-957 / %x962-963 / %x981-983 / %x9BC /
%x9BE-9C4 / %x9C7-9C8 / %x9CB-9CD / %x9D7 /
%x9E2-9E3 / %x9FE / %xA01-A03 / %xA3C /
%xA3E-A42 / %xA47-A48 / %xA4B-A4D / %xA51 /
%xA70-A71 / %xA75 / %xA81-A83 / %xABC /
%xABE-AC5 / %xAC7-AC9 / %xACB-ACD / %xAE2-AE3 /
%xAFA-AFF / %xB01-B03 / %xB3C / %xB3E-B44 /
%xB47-B48 / %xB4B-B4D / %xB55-B57 / %xB62-B63 /
%xB82 / %xBBE-BC2 / %xBC6-BC8 / %xBCA-BCD /
%xBD7 / %xC00-C04 / %xC3E-C44 / %xC46-C48 /
%xC4A-C4D / %xC55-C56 / %xC62-C63 / %xC81-C83 /
%xCBC / %xCBE-CC4 / %xCC6-CC8 / %xCCA-CCD /
%xCD5-CD6 / %xCE2-CE3 / %xD00-D03 / %xD3B-D3C /
%xD3E-D44 / %xD46-D48 / %xD4A-D4D / %xD57 /
%xD62-D63 / %xD81-D83 / %xDCA / %xDCF-DD4 /
%xDD6 / %xDD8-DDF / %xDF2-DF3 / %xE31 /
%xE34-E3A / %xE47-E4E / %xEB1 / %xEB4-EBC /
%xEC8-ECD / %xF18-F19 / %xF35 / %xF37 /
%xF39 / %xF3E-F3F / %xF71-F84 / %xF86-F87 /
%xF8D-F97 / %xF99-FBC / %xFC6 / %x102B-103E /
%x1056-1059 / %x105E-1060 / %x1062-1064 / %x1067-106D /
%x1071-1074 / %x1082-108D / %x108F / %x109A-109D /
%x135D-135F / %x1712-1714 / %x1732-1734 / %x1752-1753 /
%x1772-1773 / %x17B4-17D3 / %x17DD / %x180B-180D /
%x1885-1886 / %x18A9 / %x1920-192B / %x1930-193B /
%x1A17-1A1B / %x1A55-1A5E / %x1A60-1A7C / %x1A7F /
%x1AB0-1ABD / %x1ABF-1AC0 / %x1B00-1B04 / %x1B34-1B44 /
%x1B6B-1B73 / %x1B80-1B82 / %x1BA1-1BAD / %x1BE6-1BF3 /
%x1C24-1C37 / %x1CD0-1CD2 / %x1CD4-1CE8 / %x1CED /
%x1CF4 / %x1CF7-1CF9 / %x1DC0-1DF9 / %x1DFB-1DFF /
%x20D0-20DC / %x20E1 / %x20E5-20F0 / %x2CEF-2CF1 /
%x2D7F / %x2DE0-2DFF / %x302A-302F / %x3099-309A /
%xA66F / %xA674-A67D / %xA69E-A69F / %xA6F0-A6F1 /
%xA802 / %xA806 / %xA80B / %xA823-A827 /
%xA82C / %xA880-A881 / %xA8B4-A8C5 / %xA8E0-A8F1 /
%xA8FF / %xA926-A92D / %xA947-A953 / %xA980-A983 /
%xA9B3-A9C0 / %xA9E5 / %xAA29-AA36 / %xAA43 /
%xAA4C-AA4D / %xAA7B-AA7D / %xAAB0 / %xAAB2-AAB4 /
%xAAB7-AAB8 / %xAABE-AABF / %xAAC1 / %xAAEB-AAEF /
%xAAF5-AAF6 / %xABE3-ABEA / %xABEC-ABED / %xFB1E /
%xFE00-FE0F / %xFE20-FE2F / %x101FD / %x102E0 /
%x10376-1037A / %x10A01-10A03 / %x10A05-10A06 / %x10A0C-10A0F /
%x10A38-10A3A / %x10A3F / %x10AE5-10AE6 / %x10D24-10D27 /
%x10EAB-10EAC / %x10F46-10F50 / %x11000-11002 / %x11038-11046 /
%x1107F-11082 / %x110B0-110BA / %x11100-11102 / %x11127-11134 /
%x11145-11146 / %x11173 / %x11180-11182 / %x111B3-111C0 /
%x111C9-111CC / %x111CE-111CF / %x1122C-11237 / %x1123E /
%x112DF-112EA / %x11300-11303 / %x1133B-1133C / %x1133E-11344 /
%x11347-11348 / %x1134B-1134D / %x11357 / %x11362-11363 /
%x11366-1136C / %x11370-11374 / %x11435-11446 / %x1145E /
%x114B0-114C3 / %x115AF-115B5 / %x115B8-115C0 / %x115DC-115DD /
%x11630-11640 / %x116AB-116B7 / %x1171D-1172B / %x1182C-1183A /
%x11930-11935 / %x11937-11938 / %x1193B-1193E / %x11940 /
%x11942-11943 / %x119D1-119D7 / %x119DA-119E0 / %x119E4 /
%x11A01-11A0A / %x11A33-11A39 / %x11A3B-11A3E / %x11A47 /
%x11A51-11A5B / %x11A8A-11A99 / %x11C2F-11C36 / %x11C38-11C3F /
%x11C92-11CA7 / %x11CA9-11CB6 / %x11D31-11D36 / %x11D3A /
%x11D3C-11D3D / %x11D3F-11D45 / %x11D47 / %x11D8A-11D8E /
%x11D90-11D91 / %x11D93-11D97 / %x11EF3-11EF6 / %x16AF0-16AF4 /
%x16B30-16B36 / %x16F4F / %x16F51-16F87 / %x16F8F-16F92 /
%x16FE4 / %x16FF0-16FF1 / %x1BC9D-1BC9E / %x1D165-1D169 /
%x1D16D-1D172 / %x1D17B-1D182 / %x1D185-1D18B / %x1D1AA-1D1AD /
%x1D242-1D244 / %x1DA00-1DA36 / %x1DA3B-1DA6C / %x1DA75 /
%x1DA84 / %x1DA9B-1DA9F / %x1DAA1-1DAAF / %x1E000-1E006 /
%x1E008-1E018 / %x1E01B-1E021 / %x1E023-1E024 / %x1E026-1E02A /
%x1E130-1E136 / %x1E2EC-1E2EF / %x1E8D0-1E8D6 / %x1E944-1E94A /
%xE0100-E01EF
; 2282 codepoints in total
@marzer Great!
@pradyunsg Assuming that one of us prepares a PR, it there any change that this would be merged relatively quickly? Or does it have to wait until 1.0 is released in any case?
This feels like a significant change to TOMLs interpretation of being "minimal". Maybe we should ask Tom himself to bless this change?
Is it though? The language itself will be just as minimal as before, since this change will be backwards-compatible. In fact it would actually increase the simplicity of TOML files since keys should work in a WYSIWYG way for more people, and only require quotes in very specific circumstances.
It will complicate it for implementers, sure, but not all that much.
@marzer, i have no objections against unicode version of this proposal. However i do feel that change is too significant to apply it without asking Tom.
I can't help but think that there must be a simpler way of expressing all this. Sorry to be a wet blanket, especially after the great work that everyone here's put into making this happen.
ABNF wasn't defined for use with Unicode. The Unicode folks offer guidelines for adding Unicode support to regex engines, including character classes, but ABNF as it stands doesn't provide a way to use character classes like this.
What would it take to augment ABNF with general Unicode categories? Something like \p{Lu}
, or \p{Uppercase Letter}
, would make toml.abnf
readable, not to mention compatible with newer Unicode versions as they are released.
Or we could just cop-out and do what Scala did, but in ABNF.
upper ::= ‘A’ | … | ‘Z’ | ‘$’ | ‘_’ // and Unicode category Lu
lower ::= ‘a’ | … | ‘z’ // and Unicode category Ll
letter ::= upper | lower // and Unicode categories Lo, Lt, Nl
I'm fussing over this because I'm not keen on suddenly doubling the size of our ABNF file with a collection of character sets that are better defined somewhere else.
@eksortso: That's an interesting idea – I agree that it would be much nicer to somehow reference the Unicode categories in the ABNF file. Assuming the regex syntax were adopted as a kind of unofficial ABNF extension, the proposed new syntax for unquoted keys could be expressed very succinctly:
unquoted-key = ukey_start *ukey_continued
ukey-start = \p{L} / \p{Nd} / \p{Nl} / "_" / "-"
ukey-continued = ukey_start / \p{Mc} / \p{Mn}
An alternative, more in line with how ABNF generally works, is to assume a series of predefined UNICODE_CATEGORY_X and UNICODE_CATEGORY_XX rules, complementing the ASCII-only core rules that are already defined by the standard (such as ALPHA, DIGIT, HEXDIG). This would allow us to write
unquoted-key = ukey_start *ukey_continued
ukey-start = UNICODE_CATEGORY_L / UNICODE_CATEGORY_ND / UNICODE_CATEGORY_NL / "_" / "-"
ukey-continued = ukey_start / UNICODE_CATEGORY_MC / UNICODE_CATEGORY_MN
This would be a syntactically valid ABNF file, but, of course, the parser would complain about undefined rules.
To possibly address this, we could add a second ABNF, unicode_categories.abnf, autogenerated by a variant of @marzer 's script and re-generated whenever a new version of Unicode has been released (we should add the script to the TOML repository as well). In theory, it could list all categories, but for practical purposes, it's probably better to just generate and list the handful of rules that are actually used in the toml.abnf.
Anyone who wants to actually use the ABNF for some purposes can then concatenated the two files before handing them over to the ABNF parser. (I haven't seen any kind of "include" command in ABNF). Those who just want to understand the details of TOML's syntax model, can read the toml.abnf without having to wade through all the Unicode stuff.
After reading this discussion, it seems to start to boil down to what looks fairly close to what is allowed in XML names, which in turn is what is allowed in HTML tag-names and @id/@name
targets, which are fairly well understood.
I would, therefore, like to propose to simply take the definition of NmToken
from the XML specification. The difference Name
is that it allows starting with digits, which is allowed in TOML.
Here's the EBNF (not ABNF, but it is easy to translate/understand), which is pretty close to earlier suggestions above (the long lists in the posts), but with the addition of unassigned ranges. See the argument below for that.
/* interpreted from NameStartChar and NameChar */
NameChar := "-" | "." | [0-9] | ":" ; we should remove colon and dot
| [A-Z] | "_" | [a-z] | #xB7 ; xB7 is MIDDLE DOT
| [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF]
| [#x300-#x37D] | [#x37F-#x1FFF] ; this excludes GREEK QUESTION MARK
| [#x200C-#x200D] | [#x203F-#x2040] | [#x2070-#x218F]
| [#x2C00-#x2FEF] | [#x3001-#xD7FF]
| [#xF900-#xFDCF] | [#xFDF0-#xFFFD]
| [#x10000-#xEFFFF]
NmToken := (NameChar)+
We should probably remove the "."
and the colon ":"
from the definition
I can see a bunch of advantages for copying (most of) this definition:
#name
identifiers, they all obey to these rules and it helps adoption.XName
).
x7B - xBF
(allowed in comments above), because these ASCII ranges are used as control characters on many systems (also, there's little in there that really warrants inclusion anyway)I personally find the ease of implementation and the non-ambiguous future- and history-proof fixed ranges to be a large advantage.
Copying here so you don't have to go to the XML spec, but I found these arguments pretty conclusive and clear:
Almost all characters are permitted in names, except those which either are or reasonably could be used as delimiters. The intention is to be inclusive rather than exclusive, so that writing systems not yet encoded in Unicode can be used in XML names. See J Suggestions for XML Names for suggestions on the creation of names.
Document authors are encouraged to use names which are meaningful words or combinations of words in natural languages, and to avoid symbolic or white space characters in names. Note that COLON, HYPHEN-MINUS, FULL STOP (period), LOW LINE (underscore), and MIDDLE DOT are explicitly permitted.
(we should remove COLON and FULL STOP, obviously)
The ASCII symbols and punctuation marks, along with a fairly large group of Unicode symbol characters, are excluded from names because they are more useful as delimiters in contexts where XML names are used outside XML documents; providing this group gives those contexts hard guarantees about what cannot be part of an XML name.
@abelbraaksma That's an interesting idea. It would certainly simplify ABNF handling a lot. Nevertheless, conceptually I find
"arbitrary Unicode letters and digits as well as ASCII underscores (_) and dashes (-)"
much clearer than
"almost an XMLName, except that leading digits are allowed and a few characters have been excluded"
Everybody knows what letters and digits are, even though nobody will know all the letters that exist in various scripts. But who knows what exactly is and not allowed in an XMLName? I suspect not even the creators of XML (without looking it up).
Also, since they don't know which Unicode characters are marks (combining characters) and which are not (see discussion above), it seems that they allow an XMLName to start with a mark, i.e. a character that graphically fuses with and modifies the character that preceeds it – in front of the XMLName. Of course, well-behaving authors won't use such names, but it's still a bit odd.
Also, while I didn't check exact ranges, I suppose that most of Close Punctuation – some of which looks quite similar to the square brackets which terminate a table name – and Other Punctuation – some of which looks quite similar to the dots separating key parts in TOML – is allowed. Again, people obviously don't have to use such dubious characters, but I still would prefer not to allow them in bare keys.
A final minor note: If we go down that route (I'm not strictly against it, though I have my reservations as noted), we should certainly exclude the middle dot as well – since it's a singleton in the listed ranges, that's trivial to do.
@ChristianSi, thanks for even considering it :). Allow me to try to address your concerns from my viewpoint.
much clearer than
It is a matter of communication. Your first statement is indeed clearer. A common way to refer to the XML name production is to say "Any Unicode letter, letterlike character or digit, except dot".
The precise production is not relevant, it is permissive by design, and creates a "principle of least surprise" for users. However you write a word in your relevant language or writing system, is allowed as a keyword.
nobody will know all the letters that exist in various scripts
Indeed, but that is true for any proposal in this thread so far. But is that relevant? All you need to know is a subset, and the rule is that you can use whatever subset you are comfortable with in your preferred language.
I suspect not even the creators of XML (without looking it up).
I am part of the group that maintained XML, and we all have to look it up from time to time. But, when writing code, we never have to bother to look it up: what we feel as correct, usually is syntactically correct as well.
since they don't know which Unicode characters are marks (combining characters) and which are not it seems that they allow an XMLName to start with a mark
A bit yes and no. I accidentally copied the main diacritical marks section [#x300-#x37D]
to the production above, but they are not allowed as start character in XML and we should probably disallow it as well.
Since we cannot possibly predict the natural language used by programmers — be it from China, Taiwan , Swaziland or even abugida languages like Thai or Bengali — we shouldn't limit the ability to write in their native tongue. If you are Tamil and you have to write without diacritics (and other combining marks) it becomes illegible. But none of these languages will start with a combining mark, so that is safe to exclude.
a character that graphically fuses with and modifies the character that precedes it – in front of the XMLName.
While true, we should not consider what "graphically fuses", as that is a subject about rendering Unicode fonts. How it "looks" is irrelevant for how it parses. A parser operates on codepoints, not graphemes, clusters or byte sequences.
We should, however, make a note on normalization (regardless what Unicode subset we're going to allow). Consider ñaña
, a perfectly valid word in Spanish. This can be written 0x6E 0x303
(̃ñ
, two codepoints, n
followed by combining mark) or 0xF1
(ñ
, one codepoint). These two render slightly differently, but should mean the same. Another example is in Dutch (my language), where ij
(two codepoints i
and j
) means the same as ij
(0x133
, one codepoint). Whether or not we should say so in the spec, I would suggest that parsers follow NFC (or NFD), so that the chosen editor will not effect behavior.
An alternative is to make normalization not a mandatory thing, but just to make a note in the docs that says that it may affect identifiers. This is, btw, true for any programming (and non-programming) language out there and JSON is also not immune to normalization issues.
I suppose that most of Close Punctuation – some of which looks quite similar to the square brackets which terminate a table name – and Other Punctuation
Some are included and some aren't. The ones you find on your keyboard are definitely excluded. The ones that most resemble often-used closing/opening punctuation are also excluded (like 0x5D
or ]
, and 0x27E9
, or ⟩
.
I checked further, and indeed, with this rule in place, this is valid: footnote₍₁₎
, and this too: Yig-༒-FurSat-༆
. But is that problematic? If authors really want it, they can (and these ranges contain valid characters in some languages), but generally, I think they'll stay away from it anyway. It follows my personal favorite: be liberal in what you accept.
Again, people obviously don't have to use such dubious characters, but I still would prefer not to allow them in bare keys.
That's almost a philosophical question and much debate has preceded the current one (I've been a member of the XML working group at W3.org for 8 years). In .NET, the CLR accepts any string as identifier. But C# chose a more restrictive set. F#, in contrast, allows (almost) any character.
The more restrictive you are, the harder it is to learn what is allowed and why. Personally, the ability to write something that looks like, but not quite is a separating token in TOML, is a strong argument in favor of allowing this. If you don't want to complicate things in your own scripts, that's good, but should you disallow it for others? They'll always try to find way to express clearly what they want.
For instance, I like to use the @
-sign in keys and identifiers, because it has special meaning in my context (specifies that it applies to attributes). However, the normal @
-sign is disallowed in my language of choice, F#. So I use @
(0xFF20
), which makes my code more readable than writing at-Something
.
If the F# designers had the same idea as you: if it looks like something we currently disallow, then we should disallow it altogether, then they'd have to disallow anything that looks like reserved characters in their view. And then we haven't even discussed ligatures: we can always make something look like something forbidden.
we should certainly exclude the middle dot as well
I have no strong opinion one way or the other.
In conclusion: sorry for the long post ;). There's three more things I'd like to add, though:
@abelbraaksma: You make a strong case – consider me semi-convinced. For me, the important thing is that people should be able to use words in arbitrary languages as a kind of "bareword", without having to quote them. Whether we realize that goal by allowing "arbitrary Unicode letters and digits" or XMLNames is less important – both would work. Hence it might more or less boil down to a question of what's easier to implementers. I wonder that others here think?
Regarding ranges full of marks, you mention that "they are not allowed as start character in XML and we should probably disallow it as well." Maybe you could investigated that further? It would be interesting to see which ranges are prohibited in NameStartChar, but allowed in NameChar – and why. If they are full of non-ASCII digits, we might want to allow them even at the start, but if they are full of marks, we certainly don't.
If you do chose to adopt the XMLNames syntax (or a variant), then you get as a bonus that TOML will be roundtrippable to XML, where identifiers can be used as element names.
Not all, since quoted keys (which allow arbitrary characters) are legal as well and will not go away. But in any case I think we agree that that's not the main issue.
Normalization is an altogether different issue, since it also concerns quoted keys, so its relevance to the TOML spec is unrelated to the question of whether or not additional characters are allowed in the unquoted variety. Maybe you could open an new issue to discuss it?
@ChristianSi
Hence it might more or less boil down to a question of what's easier to implementers. I wonder that others here think?
Having just implemented the unicode group-based approach in my parser I can attest that the more permissive set of character ranges suggested by @abelbraaksma would have been easier, though now that the work is done I don't need to revisit it (except to incorporate updates to the Unicode standard, but that amounts to running a script once).
@abelbraaksma's point about varying levels of correctness among other unicode implementations is a good one, though.
I'm coming to this as someone who is incorporating TOML into a project with keys that will often contain symbols/punctuation. I've read through this thread and I have not seen anyone propose that keys allow any valid unicode except the symbols needed by the TOML parser itself. That would:
I'm not currently arguing this is the best approach but it seemed worth adding to the set of options in the discussion space.
If anyone is interested in playing around with a parser that supports this tentative feature (as specified in the OP, anyways), my C++ TOML library is now in a publishable state: https://marzer.github.io/tomlplusplus/
@thoughtafter It seems as though your suggestion is very much in-line with @abelbraaksma's (which from my reading, advocates including everything except syntactically-relevant/ambiguous characters).
In my opinion, we don't program in any language, including English. What we are coding is symbol. ASCII in programming is safe symbols, not English.
In high level languages, identifier could be defined as any charactor, because here is IDE and highlight.
But TOML is designed for ini file, usually no any extra support when editing. That's also why (and the mainly reason why TOML exists) TOML is better than YAML—because we can't indent/deindent easily, for nested values or multiline strings. In any other language, indent is better than non-indent design, we all know that.
So I think that's really dangerous to allow bare keys include non-ASCII charactor.
But, I think spec allow implementations to support user specified language bare keys support is good. What languages you are fimilar, you use that. For example, /[\u4e00-\u9fa5]/
is Chinese, so it can be easily supported and then easy to write bare key, and, safe. But who you know and care? But as specific language user, I know, I can pass range argument as an options value to parser, preserving highly controlled at the same time.
I think simplicity and nationality could be no conflict, not must one or the other—otherwise, absolute "fairness" will lead to widespread inefficiency.
@LongTengDao it's a config file format, not a database specification or real-time streaming format; I don't think 'efficiency' is all that relevant (if you mean the computational complexity of parsing, that is).
Unless you mean the efficiency of the actual implementing of the new functionality? As in, it will be a bit complex for implementers and maintainers to get this working in their parsers, thus being inefficient for them? If so, that's not even true. It's pretty easy to implement. I've done it myself, and provide relevant information in the original post.
I'm not sure what other sorts of efficiencies you could mean. It wouldn't make TOML any less efficient to write (if anything it would get simpler and easier to use as a result of this proposal).
@marzer I've never considered the difficulty of writing a parser is a hindrance, and it's not worth considering in the face of a perfectly formatted file design task. If anyone objects to this, I will be on your side.
I only mean the efficiency of writing and checking. Introducing special characters too broadly will make the process of reading and writing a file stressful again. Remember, Unicode doesn't just include characters in common languages like the ones you and me use (1en or 1em width). Instead, there are many combinations of display and invisible even right-to-left characters that are in character category, rather than punctuation or whitespace category. It's a nightmare, if you've ever developed an typesetting software like Office Word. After that, I have been frightened by words like "all valid Unicode".
But anyway, it doesn't affect my use. If spec said ASCII only, I will support user options to support any Unicode character range. If spec said any Unicode character is valid as your suggestion, I will support user options to limit ASCII only. I think this right belongs to user.
If spec said any Unicode character is valid as your suggestion
@LongTengDao to be clear, my proposal isn't to support "any Unicode character", as you seem to think. It's to support a subset (letters, numbers, and some combining marks).
Yeah there might be characters in those categories that are effectively garbage for our purposes but they can probably just be ignored; if it's not a character on a keyboard then someone has gone to effort to put it in their config, and if that breaks stuff then that's the life they chose. Parser library users can trivially add additional sanity-checking if they feel the need.
Instead, there are many combinations of display and invisible even right-to-left characters that are in character category, rather than punctuation or whitespace category.
@LongTengDao I believe the opposite to be true. Limiting users that are accustomed to right-to-left writing means limiting over 1.7 billion people worldwide to a system that is not native to them. What is perhaps perceived by you as "a nightmare" is perceived by others as a nightmare if it isn't allowed. Not everyone speaks English or can write in their native tongue using only ASCII characters (in fact, it is a relative small share of the world population).
Inclusion of other cultures, languages and writing systems is a good thing, and although TOML is not a programming language, many well-known programming language embrace inclusion more than exclusion: C#, VB, F# (allows any character), Java (they allow a broader set than defined here), Ruby, Perl, XML/HTML tag names, CSS classes/id and there are many more.
Unicode even has a specific TR that describes the recommended way for allowing Unicode characters in identifiers: https://unicode.org/reports/tr31/.
Differences between languages will always exist, but the closer a language (or a spec like TOML) gets to TR31, the better it is for the worldwide community of thousands of languages that can write in their native tongue.
If any company or individual wishes to limit the allowed set of characters in identifiers, or in coding in general, they are of course free to do so, just like coding styles exist for many programming languages, you could limit your style to "only ASCII" or whatever you prefer.
And as already has been said, the proposal here is a safe subset of the Unicode language.
Regarding ranges full of marks, you mention that "they are not allowed as start character in XML and we should probably disallow it as well." Maybe you could investigated that further? It would be interesting to see which ranges are prohibited in NameStartChar, but allowed in NameChar – and why. If they are full of non-ASCII digits, we might want to allow them even at the start, but if they are full of marks, we certainly don't.
@ChristianSi, apologies for the wait, I forgot about your question here.
The precise definition of NameChar
is that of NameStartChar
with a few additions. These additions are therefore not allowed as a starting character:
NameChar ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
Let's split that up. According to this, the following are not allowed as a starting character (and I believe this mostly follows current practice in TOML as well):
-
and the period .
0-9
xB7
, or ·
x300 - x36F
, these are the combining diacritical marks, which means you cannot start with a combining acute
or circumflex
for instance, see https://en.wikipedia.org/wiki/Combining_Diacritical_Marks. I think this makes sense.x203F
and x2040
, or ‿
(under tie) ⁀
(character tie), https://en.wikipedia.org/wiki/Tie_(typography)I'm not sure why the "tie" is forbidden as starting character in XML names (it is not a combining tie, it is spaced), but the other ones seem sensible.
I can write up a TOML spec proposal for this set, and/or extend it to TR31 if that somehow makes sense, but I think it is easier for people in general to use the XML specification (without reference to XML, of course, as it is otherwise unrelated), since they already did the necessary research, it's concise, and it's trivial to implement. TR31 is quite hard to read and probably raises new questions again.
This looks like we're missing a PR for doing this. If someone wants to pick this up, and file a PR expanding the allowed bare keys syntax to include letters from the broader unicode spec, that'd be welcome!
@pradyunsg, done, I've created a PR in #891. I tried to be both as inclusive as possible, while maintaining simplicity for parsers. Basically the rule is now: "Any Unicode letter, letterlike character or digit, except dot", as discussed above.
This has now been merged. Thanks everyone for their support and insights!
Issue
TOML's "bare key" syntax is too restrictive. People who regularly use characters from languages other than English should be able to do so in TOML keys without additional gymnastics.
I know there's already been a lot of discussion about this but much of it was from when TOML was less established and I think it warrants revisiting.
Proposed change
Expand the set of accepted characters allowed in bare keys to include letters and numbers from the entire Unicode space, similar to how identifiers are handled in other Unicode-compliant contexts (e.g. python, javascript, etc.). Specifically:
Rationale
After reading much of the existing discussion on the issue, I've identified the points below as being the main objections. I've written a counterpoint for each.
"ASCII-only is easy to understand"
Allowing Unicode letters and numbers wouldn't change the understandability of the written word in "mostly-ASCII" contexts, excepting maybe people from English-centric countries encountering characters they otherwise rarely see and being unsure how to pronounce them. I'm one of those people and my brain seems to consume them just fine. And it's almost certainly going to improve the understandability of bare keys to people for whom an ASCII environment is not their regular one.
It also wouldn't change the semantic/syntatic understandability of the language; I'm only advocating relaxing the spec to allow letter and number characters, not anything that might be confused for a language construct (no math symbols, for instance).
"Guides users to choose simple key names"
See above. I'd argue that the keys would be no less simple with this change. I live and work in a European country and a number of my friends and colleagues have non-ASCII letters in their name (e.g.
ä
). I doubt they consider their names to be complex; I certainly don't. If anything, by forcing people to jump through hoops just to type in their language, we're actually making the key names more complex w.r.t. cognitive load."Eliminate any weirdness that could come from having to deal with undelimited Unicode"
The TOML spec dictates UTF-8, not UTF-8-ish. UTF-8 is a solved problem at this point. If a parser doesn't correctly detect and handle malformed UTF-8, I'd argue that the parser needs fixing, not that we should bend over to accommodate users who are using crap tools and libraries. It's such a solved problem that you can even portably consume it using a state machine and validate it using vector intrinsics.
"Keys should be identifier-like"
Despite the fact that the concept of an "identifier" isn't a thing in TOML, I'll concede that in some situations this might be a concern. A reasonable example is using TOML in code generation contexts; if you used TOML keys to inform variable names historically you'd run into issues in many languages with non-ASCII characters, though this is no longer true. Even good old C++ supports unicode characters in identifiers on modern compilers.
...all of which is rendered moot by the fact that TOML supports hyphens in bare keys which are often invalid in identifier contexts, so this objection is a non-starter anyway.
"It complicates implementation"
It really doesn't. Many implementations will be able to leverage built-in helper functions or libraries for working with Unicode. For those that can't, I've put my money where my mouth is and implemented this as a proof-of-concept in my own TOML parser and I'm happy for my code to be used as a starting point:
is_unicode_XXXXX()
codepoint identity functions (generated by a script)Of course you might argue that simply accepting UTF-8 bytes from a TOML implementation is not an option for everyone, and you'd be right; there will always be situations where only ASCII makes sense (e.g. legacy codebases). I'd respond by pointing out that detecting non-ASCII characters in a character stream is laughably trivial. Applications requiring ASCII-only can easily enforce this themselves.