no-context / moo

Optimised tokenizer/lexer generator! 🐄 Uses /y for performance. Moo.
BSD 3-Clause "New" or "Revised" License
817 stars 65 forks source link

Add ignoreCase flag #122

Open tjvr opened 5 years ago

tjvr commented 5 years ago
tjvr commented 5 years ago

I agree; as I implied in the PR description, it feels weird leaving implicit how strings are handled.

I slightly prefer ignoreCase: true, because we can compile string literals into RegExps like /[Bb][Aa][Rr]/.

I could be persuaded to add an options dict, though.

Sent with GitHawk

nathan commented 5 years ago

we can compile string literals into RegExps like /[Bb][Aa][Rr]/

We can't do this properly for Unicode unless we include the entire case-folding map (which admittedly isn't terribly large). But I can see how having case-insensitive tokens in a case-sensitive language could be useful.

I think I keep imagining a user writing a long list of string literals like this:

moo.compile({
  if: 'if',
  else: 'else',
  then: 'then',
  ...
})

to make keywords (and having to write {ignoreCase: true, match: …} for every single one to make them case-insensitive), when really she'd just use metaprogramming / a type transform. So I think just a per-string ignoreCase is fine, as long as we don't require ignoreCase on string literals for which case is irrelevant. I want a user to be able to write this:

moo.compile({
  cat: /cat/i,
  bat: /bat/i,
  comma: ',',
  semi: ';',
  lparen: '(',
  rparen: ')',
  lbrace: '{',
  rbrace: '}',
  lbracket: '[',
  rbracket: ']',
  and: '&&',
  or: '||',
  bitand: '&',
  bitor: '|',
  ...
})

instead of this:

moo.compile({
  cat: /cat/i,
  bat: /bat/i,
  comma: {match: ',', ignoreCase: true},
  semi: {match: ';', ignoreCase: true},
  lparen: {match: '(', ignoreCase: true},
  rparen: {match: ')', ignoreCase: true},
  lbrace: {match: '{', ignoreCase: true},
  rbrace: {match: '}', ignoreCase: true},
  lbracket: {match: '[', ignoreCase: true},
  rbracket: {match: ']', ignoreCase: true},
  and: {match: '&&', ignoreCase: true},
  or: {match: '||', ignoreCase: true},
  bitand: {match: '&', ignoreCase: true},
  bitor: {match: '|', ignoreCase: true},
  ...
})

(Arguably, this would be better than either:

moo.compile({
  cat: /cat/i,
  bat: /bat/i,
  op: {match: /[,;(){}[\]]|\|\|?|&&?/i, type: v => v},
})

but there are stylistic/interoperability reasons to prefer alphanumeric token types.)

tjvr commented 5 years ago

Do we need the case-folding map, or can we use the .toUpperCase() and .toLowerCase() built-ins?

Sounds like we're agreed about making ignoreCase per-string. I agree it shouldn't be required when it's irrelevant.

Sent with GitHawk

nathan commented 5 years ago

can we use the .toUpperCase() and .toLowerCase() built-ins?

Strictly speaking, toUpperCase and toLowerCase are insufficient; e.g., s should map to [Ssſ] (including U+017F LATIN SMALL LETTER LONG S, which the case-folding built-ins won't produce). See, e.g., http://unicode.org/faq/casemap_charprop.html#2

EDIT: a less esoteric example would be Greek: σ must match [σςΣ] (U+03A3 GREEK CAPITAL LETTER SIGMA, U+03C3 GREEK SMALL LETTER SIGMA, and U+03C2 GREEK SMALL LETTER FINAL SIGMA).

tjvr commented 5 years ago

Is that for /s/i, or only /s/ui?

If the latter, then I think it would be reasonable to only support {match: "s", ignoreCase: true} when the unicode flag is not used.

Sent with GitHawk

nathan commented 5 years ago

Is that for /s/i, or only /s/ui?

Only /s/ui matches ſ (/s/i does not). However, /σ/i and /σ/ui must both match σ, ς, and Σ. See the definition of Canonicalize in the spec (and Note 4 below it): /.../i uses Unicode case folding, but refuses to map characters outside of the Basic Latin range (U+0000 through U+007f) into it.

If you're worried about the size of the map, it's large but not horribly so. Here are the simple and common mappings in CaseFolding.txt (all that we need to implement /i properly) in a fairly compact notation:

itt("AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZzµμÀàÁáÂâÃãÄäÅåÆæÇçÈèÉéÊêËëÌìÍíÎîÏïÐðÑñÒòÓóÔôÕõÖöØøÙùÚúÛûÜüÝýÞþĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįIJijĴĵĶķĹĺĻļĽľĿŀŁłŃńŅņŇňŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸÿŹźŻżŽžſsƁɓƂƃƄƅƆɔƇƈƉɖƊɗƋƌƎǝƏəƐɛƑƒƓɠƔɣƖɩƗɨƘƙƜɯƝɲƟɵƠơƢƣƤƥƦʀƧƨƩʃƬƭƮʈƯưƱʊƲʋƳƴƵƶƷʒƸƹƼƽDŽdžDždžLJljLjljNJnjNjnjǍǎǏǐǑǒǓǔǕǖǗǘǙǚǛǜǞǟǠǡǢǣǤǥǦǧǨǩǪǫǬǭǮǯDZdzDzdzǴǵǶƕǷƿǸǹǺǻǼǽǾǿȀȁȂȃȄȅȆȇȈȉȊȋȌȍȎȏȐȑȒȓȔȕȖȗȘșȚțȜȝȞȟȠƞȢȣȤȥȦȧȨȩȪȫȬȭȮȯȰȱȲȳȺⱥȻȼȽƚȾⱦɁɂɃƀɄʉɅʌɆɇɈɉɊɋɌɍɎɏͅιͰͱͲͳͶͷͿϳΆάΈέΉήΊίΌόΎύΏώΑαΒβΓγΔδΕεΖζΗηΘθΙιΚκΛλΜμΝνΞξΟοΠπΡρΣσΤτΥυΦφΧχΨψΩωΪϊΫϋςσϏϗϐβϑθϕφϖπϘϙϚϛϜϝϞϟϠϡϢϣϤϥϦϧϨϩϪϫϬϭϮϯϰκϱρϴθϵεϷϸϹϲϺϻϽͻϾͼϿͽЀѐЁёЂђЃѓЄєЅѕІіЇїЈјЉљЊњЋћЌќЍѝЎўЏџАаБбВвГгДдЕеЖжЗзИиЙйКкЛлМмНнОоПпРрСсТтУуФфХхЦцЧчШшЩщЪъЫыЬьЭэЮюЯяѠѡѢѣѤѥѦѧѨѩѪѫѬѭѮѯѰѱѲѳѴѵѶѷѸѹѺѻѼѽѾѿҀҁҊҋҌҍҎҏҐґҒғҔҕҖҗҘҙҚқҜҝҞҟҠҡҢңҤҥҦҧҨҩҪҫҬҭҮүҰұҲҳҴҵҶҷҸҹҺһҼҽҾҿӀӏӁӂӃӄӅӆӇӈӉӊӋӌӍӎӐӑӒӓӔӕӖӗӘәӚӛӜӝӞӟӠӡӢӣӤӥӦӧӨөӪӫӬӭӮӯӰӱӲӳӴӵӶӷӸӹӺӻӼӽӾӿԀԁԂԃԄԅԆԇԈԉԊԋԌԍԎԏԐԑԒԓԔԕԖԗԘԙԚԛԜԝԞԟԠԡԢԣԤԥԦԧԨԩԪԫԬԭԮԯԱաԲբԳգԴդԵեԶզԷէԸըԹթԺժԻիԼլԽխԾծԿկՀհՁձՂղՃճՄմՅյՆնՇշՈոՉչՊպՋջՌռՍսՎվՏտՐրՑցՒւՓփՔքՕօՖֆႠⴀႡⴁႢⴂႣⴃႤⴄႥⴅႦⴆႧⴇႨⴈႩⴉႪⴊႫⴋႬⴌႭⴍႮⴎႯⴏႰⴐႱⴑႲⴒႳⴓႴⴔႵⴕႶⴖႷⴗႸⴘႹⴙႺⴚႻⴛႼⴜႽⴝႾⴞႿⴟჀⴠჁⴡჂⴢჃⴣჄⴤჅⴥჇⴧჍⴭᏸᏰᏹᏱᏺᏲᏻᏳᏼᏴᏽᏵᲀвᲁдᲂоᲃсᲄтᲅтᲆъᲇѣᲈꙋᲐაᲑბᲒგᲓდᲔეᲕვᲖზᲗთᲘიᲙკᲚლᲛმᲜნᲝოᲞპᲟჟᲠრᲡსᲢტᲣუᲤფᲥქᲦღᲧყᲨშᲩჩᲪცᲫძᲬწᲭჭᲮხᲯჯᲰჰᲱჱᲲჲᲳჳᲴჴᲵჵᲶჶᲷჷᲸჸᲹჹᲺჺᲽჽᲾჾᲿჿḀḁḂḃḄḅḆḇḈḉḊḋḌḍḎḏḐḑḒḓḔḕḖḗḘḙḚḛḜḝḞḟḠḡḢḣḤḥḦḧḨḩḪḫḬḭḮḯḰḱḲḳḴḵḶḷḸḹḺḻḼḽḾḿṀṁṂṃṄṅṆṇṈṉṊṋṌṍṎṏṐṑṒṓṔṕṖṗṘṙṚṛṜṝṞṟṠṡṢṣṤṥṦṧṨṩṪṫṬṭṮṯṰṱṲṳṴṵṶṷṸṹṺṻṼṽṾṿẀẁẂẃẄẅẆẇẈẉẊẋẌẍẎẏẐẑẒẓẔẕẛṡẠạẢảẤấẦầẨẩẪẫẬậẮắẰằẲẳẴẵẶặẸẹẺẻẼẽẾếỀềỂểỄễỆệỈỉỊịỌọỎỏỐốỒồỔổỖỗỘộỚớỜờỞởỠỡỢợỤụỦủỨứỪừỬửỮữỰựỲỳỴỵỶỷỸỹỺỻỼỽỾỿἈἀἉἁἊἂἋἃἌἄἍἅἎἆἏἇἘἐἙἑἚἒἛἓἜἔἝἕἨἠἩἡἪἢἫἣἬἤἭἥἮἦἯἧἸἰἹἱἺἲἻἳἼἴἽἵἾἶἿἷὈὀὉὁὊὂὋὃὌὄὍὅὙὑὛὓὝὕὟὗὨὠὩὡὪὢὫὣὬὤὭὥὮὦὯὧᾸᾰᾹᾱᾺὰΆάιιῈὲΈέῊὴΉήῘῐῙῑῚὶΊίῨῠῩῡῪὺΎύῬῥῸὸΌόῺὼΏώΩωKkÅåℲⅎⅠⅰⅡⅱⅢⅲⅣⅳⅤⅴⅥⅵⅦⅶⅧⅷⅨⅸⅩⅹⅪⅺⅫⅻⅬⅼⅭⅽⅮⅾⅯⅿↃↄⒶⓐⒷⓑⒸⓒⒹⓓⒺⓔⒻⓕⒼⓖⒽⓗⒾⓘⒿⓙⓀⓚⓁⓛⓂⓜⓃⓝⓄⓞⓅⓟⓆⓠⓇⓡⓈⓢⓉⓣⓊⓤⓋⓥⓌⓦⓍⓧⓎⓨⓏⓩⰀⰰⰁⰱⰂⰲⰃⰳⰄⰴⰅⰵⰆⰶⰇⰷⰈⰸⰉⰹⰊⰺⰋⰻⰌⰼⰍⰽⰎⰾⰏⰿⰐⱀⰑⱁⰒⱂⰓⱃⰔⱄⰕⱅⰖⱆⰗⱇⰘⱈⰙⱉⰚⱊⰛⱋⰜⱌⰝⱍⰞⱎⰟⱏⰠⱐⰡⱑⰢⱒⰣⱓⰤⱔⰥⱕⰦⱖⰧⱗⰨⱘⰩⱙⰪⱚⰫⱛⰬⱜⰭⱝⰮⱞⱠⱡⱢɫⱣᵽⱤɽⱧⱨⱩⱪⱫⱬⱭɑⱮɱⱯɐⱰɒⱲⱳⱵⱶⱾȿⱿɀⲀⲁⲂⲃⲄⲅⲆⲇⲈⲉⲊⲋⲌⲍⲎⲏⲐⲑⲒⲓⲔⲕⲖⲗⲘⲙⲚⲛⲜⲝⲞⲟⲠⲡⲢⲣⲤⲥⲦⲧⲨⲩⲪⲫⲬⲭⲮⲯⲰⲱⲲⲳⲴⲵⲶⲷⲸⲹⲺⲻⲼⲽⲾⲿⳀⳁⳂⳃⳄⳅⳆⳇⳈⳉⳊⳋⳌⳍⳎⳏⳐⳑⳒⳓⳔⳕⳖⳗⳘⳙⳚⳛⳜⳝⳞⳟⳠⳡⳢⳣⳫⳬⳭⳮⳲⳳꙀꙁꙂꙃꙄꙅꙆꙇꙈꙉꙊꙋꙌꙍꙎꙏꙐꙑꙒꙓꙔꙕꙖꙗꙘꙙꙚꙛꙜꙝꙞꙟꙠꙡꙢꙣꙤꙥꙦꙧꙨꙩꙪꙫꙬꙭꚀꚁꚂꚃꚄꚅꚆꚇꚈꚉꚊꚋꚌꚍꚎꚏꚐꚑꚒꚓꚔꚕꚖꚗꚘꚙꚚꚛꜢꜣꜤꜥꜦꜧꜨꜩꜪꜫꜬꜭꜮꜯꜲꜳꜴꜵꜶꜷꜸꜹꜺꜻꜼꜽꜾꜿꝀꝁꝂꝃꝄꝅꝆꝇꝈꝉꝊꝋꝌꝍꝎꝏꝐꝑꝒꝓꝔꝕꝖꝗꝘꝙꝚꝛꝜꝝꝞꝟꝠꝡꝢꝣꝤꝥꝦꝧꝨꝩꝪꝫꝬꝭꝮꝯꝹꝺꝻꝼꝽᵹꝾꝿꞀꞁꞂꞃꞄꞅꞆꞇꞋꞌꞍɥꞐꞑꞒꞓꞖꞗꞘꞙꞚꞛꞜꞝꞞꞟꞠꞡꞢꞣꞤꞥꞦꞧꞨꞩꞪɦꞫɜꞬɡꞭɬꞮɪꞰʞꞱʇꞲʝꞳꭓꞴꞵꞶꞷꞸꞹꭰᎠꭱᎡꭲᎢꭳᎣꭴᎤꭵᎥꭶᎦꭷᎧꭸᎨꭹᎩꭺᎪꭻᎫꭼᎬꭽᎭꭾᎮꭿᎯꮀᎰꮁᎱꮂᎲꮃᎳꮄᎴꮅᎵꮆᎶꮇᎷꮈᎸꮉᎹꮊᎺꮋᎻꮌᎼꮍᎽꮎᎾꮏᎿꮐᏀꮑᏁꮒᏂꮓᏃꮔᏄꮕᏅꮖᏆꮗᏇꮘᏈꮙᏉꮚᏊꮛᏋꮜᏌꮝᏍꮞᏎꮟᏏꮠᏐꮡᏑꮢᏒꮣᏓꮤᏔꮥᏕꮦᏖꮧᏗꮨᏘꮩᏙꮪᏚꮫᏛꮬᏜꮭᏝꮮᏞꮯᏟꮰᏠꮱᏡꮲᏢꮳᏣꮴᏤꮵᏥꮶᏦꮷᏧꮸᏨꮹᏩꮺᏪꮻᏫꮼᏬꮽᏭꮾᏮꮿᏯAaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz𐐀𐐨𐐁𐐩𐐂𐐪𐐃𐐫𐐄𐐬𐐅𐐭𐐆𐐮𐐇𐐯𐐈𐐰𐐉𐐱𐐊𐐲𐐋𐐳𐐌𐐴𐐍𐐵𐐎𐐶𐐏𐐷𐐐𐐸𐐑𐐹𐐒𐐺𐐓𐐻𐐔𐐼𐐕𐐽𐐖𐐾𐐗𐐿𐐘𐑀𐐙𐑁𐐚𐑂𐐛𐑃𐐜𐑄𐐝𐑅𐐞𐑆𐐟𐑇𐐠𐑈𐐡𐑉𐐢𐑊𐐣𐑋𐐤𐑌𐐥𐑍𐐦𐑎𐐧𐑏𐒰𐓘𐒱𐓙𐒲𐓚𐒳𐓛𐒴𐓜𐒵𐓝𐒶𐓞𐒷𐓟𐒸𐓠𐒹𐓡𐒺𐓢𐒻𐓣𐒼𐓤𐒽𐓥𐒾𐓦𐒿𐓧𐓀𐓨𐓁𐓩𐓂𐓪𐓃𐓫𐓄𐓬𐓅𐓭𐓆𐓮𐓇𐓯𐓈𐓰𐓉𐓱𐓊𐓲𐓋𐓳𐓌𐓴𐓍𐓵𐓎𐓶𐓏𐓷𐓐𐓸𐓑𐓹𐓒𐓺𐓓𐓻𐲀𐳀𐲁𐳁𐲂𐳂𐲃𐳃𐲄𐳄𐲅𐳅𐲆𐳆𐲇𐳇𐲈𐳈𐲉𐳉𐲊𐳊𐲋𐳋𐲌𐳌𐲍𐳍𐲎𐳎𐲏𐳏𐲐𐳐𐲑𐳑𐲒𐳒𐲓𐳓𐲔𐳔𐲕𐳕𐲖𐳖𐲗𐳗𐲘𐳘𐲙𐳙𐲚𐳚𐲛𐳛𐲜𐳜𐲝𐳝𐲞𐳞𐲟𐳟𐲠𐳠𐲡𐳡𐲢𐳢𐲣𐳣𐲤𐳤𐲥𐳥𐲦𐳦𐲧𐳧𐲨𐳨𐲩𐳩𐲪𐳪𐲫𐳫𐲬𐳬𐲭𐳭𐲮𐳮𐲯𐳯𐲰𐳰𐲱𐳱𐲲𐳲𑢠𑣀𑢡𑣁𑢢𑣂𑢣𑣃𑢤𑣄𑢥𑣅𑢦𑣆𑢧𑣇𑢨𑣈𑢩𑣉𑢪𑣊𑢫𑣋𑢬𑣌𑢭𑣍𑢮𑣎𑢯𑣏𑢰𑣐𑢱𑣑𑢲𑣒𑢳𑣓𑢴𑣔𑢵𑣕𑢶𑣖𑢷𑣗𑢸𑣘𑢹𑣙𑢺𑣚𑢻𑣛𑢼𑣜𑢽𑣝𑢾𑣞𑢿𑣟𖹀𖹠𖹁𖹡𖹂𖹢𖹃𖹣𖹄𖹤𖹅𖹥𖹆𖹦𖹇𖹧𖹈𖹨𖹉𖹩𖹊𖹪𖹋𖹫𖹌𖹬𖹍𖹭𖹎𖹮𖹏𖹯𖹐𖹰𖹑𖹱𖹒𖹲𖹓𖹳𖹔𖹴𖹕𖹵𖹖𖹶𖹗𖹷𖹘𖹸𖹙𖹹𖹚𖹺𖹛𖹻𖹜𖹼𖹝𖹽𖹞𖹾𖹟𖹿𞤀𞤢𞤁𞤣𞤂𞤤𞤃𞤥𞤄𞤦𞤅𞤧𞤆𞤨𞤇𞤩𞤈𞤪𞤉𞤫𞤊𞤬𞤋𞤭𞤌𞤮𞤍𞤯𞤎𞤰𞤏𞤱𞤐𞤲𞤑𞤳𞤒𞤴𞤓𞤵𞤔𞤶𞤕𞤷𞤖𞤸𞤗𞤹𞤘𞤺𞤙𞤻𞤚𞤼𞤛𞤽𞤜𞤾𞤝𞤿𞤞𞥀𞤟𞥁𞤠𞥂𞤡𞥃ẞßᾈᾀᾉᾁᾊᾂᾋᾃᾌᾄᾍᾅᾎᾆᾏᾇᾘᾐᾙᾑᾚᾒᾛᾓᾜᾔᾝᾕᾞᾖᾟᾗᾨᾠᾩᾡᾪᾢᾫᾣᾬᾤᾭᾥᾮᾦᾯᾧᾼᾳῌῃῼῳ").chunksOf(2).toObject()

That comes out to <6KB gzipped UTF-8.

EDIT: That can actually probably be made much smaller, because toUpperCase can recover the majority of these. I will see what I can do.

tjvr commented 5 years ago

I added an ignoreCase option for literals. We don't yet allow it to be used in isolation; that can come in a future PR, so you can write:

moo.compile({
  digits: /[0-9]+/,
  cow: {match: "cow", ignoreCase: true},
})

...although I imagine the next feature request will relate to case-insensitive literals, so perhaps this needs more thought.

Note that the check I'm using for whether case is relevant for a literal is probably insufficient, for the same case-folding-related reason as you've explained above.

Sent with GitHawk

nathan commented 5 years ago

Here are 809 bytes (gzipped) that generate the full map:

function d(r){for(var a=Array.from(r),o=[],i=0;i<a.length;){var t=a[i++],e=-1,f=a[i]&&a[i].charCodeAt(0);if(f<64){e=31&f;var n=a[++i]&&a[i].charCodeAt(0);n<64&&(e|=(31&n)<<5,++i)}if(o.push(t),-1<e)for(var d=0,A=t.codePointAt(0);d<=e>>1;++d)A+=1+(1&e),o.push(String.fromCodePoint(A))}return o}for(var CASE_FOLD={},i=(a=d('A0!À*!Ø*Ā-!IJ#Ĺ-Ŋ-!Ź#Ɓ Ƅ!Ƈ!Ɗ Ǝ$Ɠ Ɩ"Ɯ Ɵ Ƣ#Ƨ!Ƭ!Ư!Ʋ Ƶ!ƸƼDŽ LJ NJ Ǎ-Ǟ/DZ Ǵ!Ƿ Ǻ7!Ⱥ Ƚ Ɂ!Ʉ"Ɉ%Ͱ!ͶͿΆ!Ή Ό!Ώ!Β<Σ.ϏϘ5ϴϷ!ϺϽ"#Ѡ?Ҋ5!Ӂ+Ӑ="Ա("Ⴀ("ჇჍᲐ2"Ჽ"Ḁ3$ẞ?"Ἀ,Ἐ(Ἠ,Ἰ,Ὀ(Ὑ%Ὠ,ᾈ,ᾘ,ᾨ,Ᾰ&Ὲ&Ῐ$Ῠ&Ὸ&ΩK ℲⅠ<ↃⒶ0!Ⰰ:"Ⱡ!Ᵽ Ⱨ%Ɱ"ⱲⱵⱾ"Ⲃ?"Ⳬ!ⳲꙀ+!Ꚁ9Ꜣ+Ꜳ;!Ꝺ#Ꝿ\'Ꞌ!Ꞑ!Ꞗ3Ɜ$Ʞ&Ꞷ!A0!𐐀,"𐒰$"𐲀"#𑢠<!𖹀<!𞤀 "')).length;i--;)CASE_FOLD[a[i]]=a[i].toLowerCase();var a=d("µſͅςϐ ϕ ϰ ϵᏸ(ᲀ.ẛιꭰ<$"),b=d("μsισβθφπκρεᏰ(в!ос тъѣꙋṡιᎠ<$");for(i=a.length;i--;)CASE_FOLD[a[i]]=b[i];

And a gist with the code I used to generate them.

EDIT: This relies on Array.from(String), String.fromCodePoint, and String.prototype.codePointAt, all of which we will likely need for Unicode support in other places too, and all of which have fairly concise shims.

@tjvr

...although I imagine the next feature request will relate to case-insensitive literals, so perhaps this needs more thought.

Not sure what you mean by this.

Note that the check I'm using for whether case is relevant for a literal is probably insufficient, for the same case-folding-related reason as you've explained above.

I believe it actually is both necessary and sufficient. Since every character not in CaseFolding.txt maps to itself, we can just exhaustively check the ones in CaseFolding.txt:

> itt.entries(CASE_FOLD).flatten().every(c => c.toLowerCase() !== c.toUpperCase())
true

EDIT: even more exhaustive:

// CASE_FOLD_CPS is a set of every code point in CaseFolding.txt (including T and F mappings)
> itt.range(0x10FFFF).every(c =>
... CASE_FOLD_CPS.has(c) ||
... String.fromCodePoint(c).toUpperCase() === String.fromCodePoint(c).toLowerCase())
true
tjvr commented 5 years ago

Nice one! Would you like to PR that? (Perhaps unminified?)

I believe it actually is both necessary and sufficient.

Nice -- thanks for comprehensively confirming that.

Regarding my comment about keywords: I have two remaining concerns with this approach. One is that it feels a little bit "magic"; I'm not convinced it's easy to explain when ignoreCase should be used, given the behaviour in this PR. Perhaps it would be better to make things explicit, and go with the options dictionary you suggested originally.

And in particular, now that keywords() is a function by itself, combining it with ignoreCase produces potentially counter-intuitive behaviour:

const lexer = moo.compile({
  ws: /[ \t]/i,
  word: {
    match: /[a-z]+/i,
    ignoreCase: true,
    type: moo.keywords({
      if: "if",
      else: "else",
    }),
  },
})

lexer.reset("foo IF")
// word foo
// ws
// word IF
nathan commented 5 years ago

Those are good points. Perhaps we should separate the /i changes into their own PR, since /u doesn't have these issues?

Nice one! Would you like to PR that? (Perhaps unminified?)

I can work on a PR (definitely unminified) for unicode ignoreCase after we sort out the design we want for /i / ignoreCase.

tjvr commented 5 years ago

I was thinking the same thing. I opened #123, which adds only the unicode flag.

Once that's merged, I'll rebase this PR, to keep the conversation about ignoreCase in one place... although this is getting quite long :grimacing: In conclusion, do you prefer the options dict approach, or the ignoreCase option for strings? Or something else entirely?

nathan commented 5 years ago

do you prefer the options dict approach, or the ignoreCase option for strings? Or something else entirely?

I think they both make the keywords scenario pretty confusing and unintuitive. Neither this:

moo.compile({
  ws: /[ \t]/i,
  word: {
    match: /[a-z]+/i,
    ignoreCase: true,
    type: moo.keywords({
      if: "if",
      else: "else",
    }),
  },
})

nor this:

moo.compile({
  ws: /[ \t]/i,
  word: {
    match: /[a-z]+/i,
    type: moo.keywords({
      if: "if",
      else: "else",
    }),
  },
}, {ignoreCase: true})

actually does what you'd expect (i.e., treats "If" and "iF" and "IF" as if tokens).

TheKnarf commented 5 years ago

Any status on this? I'm trying to write a SQL parser and therefor need case independent keyword parsing.

jdoklovic commented 4 years ago

I REALLY need this too. I can't find any reasonable way to implement the following matcher that I need to use:

/was\s+not\s+in|is\s+not|not\s+in|was\s+not|was\s+in|is|in|was|changed/i

any word on this?

nathan commented 4 years ago

@jdoklovic

I REALLY need this too. I can't find any reasonable way to implement the following matcher that I need to use:

If you don't care about Unicode, you can use something like this to transform the RegExp:

function insensitive(r) {
  const esc = (s, a = '', b = '') => s.replace(/[a-z]/gi, c =>
    `${a}${c.toUpperCase()}${c.toLowerCase()}${b}`)
  const PART = /(\\u[\da-fA-F]{4}|\\x[\da-fA-F]{2}|\\c[a-zA-Z]|\\.)|(\[(?:\\.|[^\]])*\])/
  const ESCAPE = /(\\u[\da-fA-F]{4}|\\x[\da-fA-F]{2}|\\c[a-zA-Z]|\\.)/
  return new RegExp(r.source.split(PART).map((s, i) => 
    i % 3 === 1 ? s : 
    i % 3 ? s && s.split(ESCAPE).map((t, j) =>
      j % 2 ? t : esc(t)).join('') :
    esc(s, '[', ']')).join(''), r.flags.replace('i', ''))
}
insensitive(/was\s+not\s+in|is\s+not|not\s+in|was\s+not|was\s+in|is|in|was|changed/i)
// => /[Ww][Aa][Ss]\s+[Nn][Oo][Tt]\s+[Ii][Nn]|[Ii][Ss]\s+[Nn][Oo][Tt]|[Nn][Oo][Tt]\s+[Ii][Nn]|[Ww][Aa][Ss]\s+[Nn][Oo][Tt]|[Ww][Aa][Ss]\s+[Ii][Nn]|[Ii][Ss]|[Ii][Nn]|[Ww][Aa][Ss]|[Cc][Hh][Aa][Nn][Gg][Ee][Dd]/