skvadrik / re2c

Lexer generator for C, C++, D, Go, Haskell, Java, JS, OCaml, Python, Rust, V and Zig.
https://re2c.org
Other
1.11k stars 173 forks source link

UTF8 enoding #250

Closed dtp555-1212 closed 5 years ago

dtp555-1212 commented 5 years ago

It appears there is a bug in the UTF8 encoding (at least for some characters)...

utf8bug.zip

In the attached file... there is a 2 byte UTF character which should be encoded as C3 A9 ... (if you copy/paste the UTF char into a file by itself, then use od -t x1, you will see that it is indeed C3 A9). The C3 in the generated parser is correct, but then generates 83 as the second target byte. I am using -8 on the command line. (If there is something I am doing wrong, or if there is a workaround, please let me know)

skvadrik commented 5 years ago

Eh, it's a duplicate of #237. The problem is, re2c -8 option does not give you source-level Unicode support: if you write characters like é in regexp definitons, re2c interprets it as a plain byte sequence (each byte as a single character), not as one Unicode symbol. You have to use "\u00e9" instead.

I realize this is very ugly, difficult to use, confusing and needs fixing.

What exactly happens in case of é and how re2c ends up with C3 83 byte sequence is explained in great detail in #237 (let me know if you need more clarifications).

dtp555-1212 commented 5 years ago

Thanks for your reply... Obviously there are many Unicode values, not just the one I provided in my example. Do I understand you correctly, that I cannot provide the escaped hex byte sequence. I must use a unicode equivalent. (in this case \u00e9 for the two byte sequence C3 9A)... Is this understanding correct? Will this work for the 3 & 4 byte unicode values as well? (and not only match the character 'visually' but have the expected byte count for utf-8?) With this understanding, it sounds like I will have to preprocess the input strings to substitute the appropriate unicode encoding prior to processing with re2c. Do you have a suggested tool for that?

Thanks again P.S. as you have acknowledged that this needs addressing, do you have a timeframe that it might be implemented?


From: Ulya Trofimovich notifications@github.com Sent: Wednesday, May 22, 2019 3:05 PM To: skvadrik/re2c Cc: dtp555-1212; Author Subject: Re: [skvadrik/re2c] UTF8 enoding (#250)

Eh, it's a duplicate of #237https://github.com/skvadrik/re2c/issues/237. The problem is, re2c -8 option does not give you source-level Unicode support: if you write characters like é in regexp definitons, re2c interprets it as a plain byte sequence (each byte as a single character), not as one Unicode symbol. You have to use "\u00e9" instead.

I realize this is very ugly, difficult to use, confusing and needs fixing.

What exactly happens in case of é and how re2c ends up with C3 83 byte sequence is explained in great detail in #237https://github.com/skvadrik/re2c/issues/237 (let me know if you need more clarifications).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/skvadrik/re2c/issues/250?email_source=notifications&email_token=ADDLWOJHPLAUU2C5LVHPYL3PWWYQDA5CNFSM4HOXQJRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWAK6LI#issuecomment-494972717, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ADDLWOOK4QKTNJNGPKTA35DPWWYQDANCNFSM4HOXQJRA.

skvadrik commented 5 years ago

Do I understand you correctly, that I cannot provide the escaped hex byte sequence. I must use a unicode equivalent.

Yes, it won't work. If you try regular expression \xC3\x9A in -8 mode, re2c will interpret it as "code point C3 followed by a code point 9A", both of which translate into 2-byte code unit sequences in UTF-8. The same happens when instead of \xC3\x9A you write é (only re2c doesn't have to unescape bytes).

Will this work for the 3 & 4 byte unicode values as well?

Escaped sequences will work for all Unicode code points (re2c supports 2-byte, 4-byte and 8-byte syntax: \xhh, \uhhhh and \Uhhhhhhhh).

it sounds like I will have to preprocess the input strings to substitute the appropriate unicode encoding prior to processing with re2c. Do you have a suggested tool for that?

No, unfortunately I don't. In a similar issue #235 we ended up with a pre-defined set of Unicode categories, but it's not good enough for your case.

P.S. as you have acknowledged that this needs addressing, do you have a timeframe that it might be implemented?

I might be able to fix this in a few days. I have a sketch of the fix already, but it requires some pre-requisite work in order to make it more elegant. It's a matter of using -8 in re2c own lexer (which is written in re2c) and switching between two different lexers (ASCII and UTF8). The new behavior will be guarded by an option, something like --input-encoding <ascii | utf8>.

skvadrik commented 5 years ago

Pushed a fix: https://github.com/skvadrik/re2c/commit/29a6d01984f158f50406fa8246a24ee1a7246efe.

Now it is possible to use UTF-8 encoded strings in regular expressions (in string literals and character classes). The new behaviour is enabled with option --input-encoding utf8. By default re2c assumes --input-encoding ascii; in future it may be possible to flip default behaviour (if it keeps confusing people).

It was necessary to use a new option instead of reusing -8, because one may wish to generate multiple lexers with different output encoding from the same set of UTF-8 encoded rules. That is, one may need to combine --input-encoding utf8 with one of the options -u, -x, -w, etc., and not necessarily -8.

I deliberately chose a broad name for the new option (as opposed to a more precise --utf-8-literals or some such) so that it can be extended it in future, for example support UTF-8 encoded variable names (I do not see any good in that so far though).

skvadrik commented 5 years ago

@dtp555-1212 If you can, please send me your real-world test. If it's closed-source, I only need the grammar rules (though a working self-contained example is always great).

dtp555-1212 commented 5 years ago

Attached is a list of words that have utf8 chars in them, and the other would be the rule to insert into the test program previously provided.

Hope that helps

Thanks


From: Ulya Trofimovich notifications@github.com Sent: Friday, May 24, 2019 6:42 AM To: skvadrik/re2c Cc: dtp555-1212; Mention Subject: Re: [skvadrik/re2c] UTF8 enoding (#250)

@dtp555-1212https://github.com/dtp555-1212 If you can, please send me your real-world test. If it's closed-source, I only need the grammar rules (though a working self-contained example is always great).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/skvadrik/re2c/issues/250?email_source=notifications&email_token=ADDLWONGB6N44M7KT2TP2NDPW7PC7A5CNFSM4HOXQJRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWFGKIQ#issuecomment-495609122, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ADDLWOJDVR6SOC5B7XKE2UDPW7PC7ANCNFSM4HOXQJRA.

Abadía Åberg Abián Adám Ádám Adenízia Áder Adrián Ágatha Agustín Ahouré Aída Aïda Ajeé Akgül Alagía Alarcón Aléman Álex Alizé Alizée Álvarez Álvaro Amélie Anaís Anaïs Anastasákis Andéol András André Andréanne Andrée Andrés Andújar Anél Ángel Ángela Angélil Aníbal Aníta Añor Antónia António Aoás Apolónia Araújo Arbeláez Arcón Arévalo Áron Ásdís Auböck Augé Áurea Aurélie Aurélien Ávila Baláz Balázs Ballivián Bárbara Bård Barnabé Barré Barták Barteková Baugé Bäumer Béatrice Bécaud Bédard Bédié Begoña Béla Bélanger Belascoarán Belén Bělohlávek Beltré Benavídez Bendegúz Benítez Benjámin Benoît Beresová Bermúdez Bernabéu Bernárdez Béryl Beyoncé Böckler Boczkó Boglárka Bolaños Bolívar Bolükbasi Borgström Borlée Böröcz Botín Briceño Brücken Brzobohatý Bubeník Bublé Bühler Búranová Büsra Büthe Büyükakcay Byström Cabrnochová Cáceres Calderón Cañadilla Cañas Cañavate Canelón Cánepa Cantú Capó Cárdenas Carlén Carré Casañas Cassarà Cássia Castellaños Cátia Cazaubón Cebrián Cécile Cécilia Cédric Célestin Céline Célio Čepický Cerén César Céspedes Cézanne Chacón Chaunté Chávez Chihuán Chloé Chrétien Cibrián Cintrón Cíosóig Cissé Clélia Clémence Clément Clévenot Colón Compaoré Conceição Concepción Condé Córdoba Cordón Córdova Cortés Crépeau Cristóbal Cubillán Cué Cuétara Cynné Czaková Czigány Daabousová Dallapé Dániel Danièle Danté Dávalos Dávid DawnCheré Débora Déborah Déby Décary Delía Dembélé Dénes Dépré DerlisRamón Dési Desirée Desrosières Díaz Diémé Dièye Dilmé Djá Djénébou Dolínek Domínguez Donté Dóra Dorjsürengiin Dostál Duchonová Ducó Dueñas Dukátová Durán Dvorák Echávarri Echevarría Éder Édgar Ekateríni Élodie Elphége Émane Émile Emilíana Émilie Épangue Erdélyi Ergüven Érica Érick Érika España Espíndola Étienne Eugénie Eurén Éva Éve Évora Fabián Fábio Fabíola Fagúndez Fältskog Fariña Felício Félix Ferencová Fernández Flávia Flesjå Flóra Florenç Flügel Flüggen Foldházi François Françoise Frédéric Frédérick Frisé Fürste Gábor Gádorfalvi Gagné Gáliková Gándara Garbiñe García Garrigós Gascón Gáspár Gastón Gaudí Gélineau Geneviève Gérard Germán Gerónimo Géroudet Gévrise Giménez Ginóbili Gnassingbé Gomà Gómez Gonçalves Göncz González Göran Grátz Grégory Grévy Grimké Grimsbö Grímsson Grönberg Grövdal Guillén Güldeniz Gülec Gulldén Gümbel Gündegmaa Günes Günther Gutiérrez Güvenc Guzmán György Gyurcsány Häfner Háido Håkan Hambüchen Hamchétou Hárai Härstedt Håvard Havlát Héléna Hélene Hendrychová Hernán Hernández Hernangómez Hervé Hidvégi Higuaín Hinriksdóttir Hjálmsdóttir Holingerová Holló Horváth Hosnyánszky Hosszú Hrasnová Hristóforos Hrivnák Hufnágel Hultén Hüseyin Hypólito Hyryläinen Ibañez Ibargüen Idéhn Ié Illés Inácio Iñárritu Inés István Iván Jackée Jágr Jakubský Jámison Jämsä Janatková János Járóka Jaurès Jeremiáš Jérémie Jérémy Jérent Jérome Jéssica Jesús Jhené Jiménez Jiří João Joaquín Joëlle Jóhannsson Jonatán JonBenét Jördis Jorén Josée Josué Jóźwiak Juhász Júlio Júnior Juppé Jürgen Jurinová Kaboré Kafétien Kaká Kalovský Kapás Karlström Karolína Kasó Katarína Kätlin Kévin Kemrová Késely Kévin Khloé Khüderbulga Kléber Kléberson Klobucník Klocová Klöden Kněžínková Köbrich Köhler Kohlová Koňařík Kořán Kovács Kovágó Kozák Krejčí Kristián Krisztián Krizsán Krüger Kühn Kühne Kylliäinen Laanmäe Labbé Laferrière Laprovíttola Larrañaga László Lázaro Léa Léandre Lefèvre Leitón Lemprière León Lepistö Lerú Lidström Lillána Listopadová Liván Lívia Lloréns Lluís Löke Longová López Lotiès Lövnes Lü Lübeck Lucía Lückenkemper Luís Lukás Lukáš Lúthersdóttir Madaí Madarász Mäe Magallán Mägi Mahé Maíla Majdán Mäkelä Mandátová Mané Mangué Marc-André's Márcio Maréchal Marí Mária María Marílson Marín Mariño Mário Márk Marozsán Márquez Martí Martín Martínek Martínez Márton Massó Mätas Máté Matías Matús Maurício Máximo Meité Mélanie Mélina Méline Méndez Meroúsis Micheál Michèle Mihaíl Mijaín Miklós Millán Miltiádis Moisés Mokosová Molnár Mónaco Monáe Mónica Mónika Montaño Morén Mörk Mörö Müller Muñiz Muñoz Murúa Nádia Naïm Natália Negrón Németh Néstor Niccolò Nicolás Niinistö Nóbrega Noélie Noémie Nordén Núbia Nuñez Ódorová Öhrström Ólafur Óleo Opatrný Orbán Ordóñez Oréane Ortíz Óscar Ozlü Ozyüksel Pääbo Pabón Padacké Pádraig Páez Pajón Pál Palát Panayióta Pär Paré Pärt Patiño Patrícia Patrocínio Pattantyús Pavón Péché Péchoux Pelikán Peña Peñate Pénélope Péni Pépin Pérez Perón Pétain Petchamé Péter Pétervári Philémon Phúc Piétrus Pinzón Pité Pitkämäki Poésy Pokorný Polívka Póta Préval Prokopová Puigcercós Pürevjargalyn Putálová Quiñones Quiñonez Quintillà Quvenzhané Rácz Ramírez Raúl Řebíček Récsei Rédli Réka Rémi Renáta Rendón René Renée Rénelle Rentería Repcík Reséndiz Rézola Ribéry Richárd Ríga Robenílson Róbert Róchez Rocío Rodríguez Rogério Rolfö Román Romová Rónald Rosário Rubén Rühr Ruíz Sá Saborío Sagardía Sallói Salomé Salvadó Samassékou Sánchez Sandé Sándor Sardá Sárosi Sátila Saúl Saunière Savón Scalamandré Schächter Schäfer Schäuble Schlögl Schmiedlová Schön Schröder Schüpbac Schüssel Schütze Séamus Seán Sebastián Sébastien Sebestyén Sélom Sène Senyürek Seppälä Sepúlveda Sérgio Shkëlzen Sicília Silfvén Siljamäki Sinéad Sjåstad Sjöberg Sjödin Sjöström Skantár Söderberg Söderling Sofía Solé Solís Söllner Somorácz Sörenstam Ståhl Ståle Stefanídi Stéphane Stéphanie Strålman Strömberg Stübe Studničková Suárez Šuláková Süle Süleyman Švácha Svärd Svennerstål Szabián Szabó Szász Szilágyi Szomolányi Szücs Szwarnóg Taaramäe Tabaré Tainá Takács Támara Tamás Tarragó Tazegül Tcheuméo Tchórz Téa Tentóglou Teófilo Teré Tévez Thaísa Théo Théophile Thérèse Théry Thiéry Tímea Tió Todenhöfer Tomáš Tomorkhüleg Tõnu Topolánek Tormé Tornéus Törnroos Török Tórrez Tórtola Tóth Touadéra Tramèr Traoré Träsch Trévor Tsinopoúlou Túñez Türk Tüvshinbat Tüvshinbayar Üitümen Ünal Ungvári Urán Úrsula Üstündag Václav Valdés Valentín Valérian Valériane Välimäki Vallée Vámos Vásquez Vázquez Velázquez Veldáková Venyercsán Veréb Verón Verrasztó Víctor Victória Viktória Vilató Villaécija Villafría Vinícius Viñolas Vitória Vladimír Wallén Wálter Wanyá Wé Wéverton Wikström Xénia Yáñez Younés Zagré Zalánki Zelená Zélia Zoltán Zságer Zsófia

utfExamples = ( "Abadía"| "Åberg"| "Abián"| "Adám"| "Ádám"| "Adenízia"| "Áder"| "Adrián"| "Ágatha"| "Agustín"| "Ahouré"| "Aída"| "Aïda"| "Ajeé"| "Akgül"| "Alagía"| "Alarcón"| "Aléman"| "Álex"| "Alizé"| "Alizée"| "Álvarez"| "Álvaro"| "Amélie"| "Anaís"| "Anaïs"| "Anastasákis"| "Andéol"| "András"| "André"| "Andréanne"| "Andrée"| "Andrés"| "Andújar"| "Anél"| "Ángel"| "Ángela"| "Angélil"| "Aníbal"| "Aníta"| "Añor"| "Antónia"| "António"| "Aoás"| "Apolónia"| "Araújo"| "Arbeláez"| "Arcón"| "Arévalo"| "Áron"| "Ásdís"| "Auböck"| "Augé"| "Áurea"| "Aurélie"| "Aurélien"| "Ávila"| "Baláz"| "Balázs"| "Ballivián"| "Bárbara"| "Bård"| "Barnabé"| "Barré"| "Barták"| "Barteková"| "Baugé"| "Bäumer"| "Béatrice"| "Bécaud"| "Bédard"| "Bédié"| "Begoña"| "Béla"| "Bélanger"| "Belascoarán"| "Belén"| "Bělohlávek"| "Beltré"| "Benavídez"| "Bendegúz"| "Benítez"| "Benjámin"| "Benoît"| "Beresová"| "Bermúdez"| "Bernabéu"| "Bernárdez"| "Béryl"| "Beyoncé"| "Böckler"| "Boczkó"| "Boglárka"| "Bolaños"| "Bolívar"| "Bolükbasi"| "Borgström"| "Borlée"| "Böröcz"| "Botín"| "Briceño"| "Brücken"| "Brzobohatý"| "Bubeník"| "Bublé"| "Bühler"| "Búranová"| "Büsra"| "Büthe"| "Büyükakcay"| "Byström"| "Cabrnochová"| "Cáceres"| "Calderón"| "Cañadilla"| "Cañas"| "Cañavate"| "Canelón"| "Cánepa"| "Cantú"| "Capó"| "Cárdenas"| "Carlén"| "Carré"| "Casañas"| "Cassarà"| "Cássia"| "Castellaños"| "Cátia"| "Cazaubón"| "Cebrián"| "Cécile"| "Cécilia"| "Cédric"| "Célestin"| "Céline"| "Célio"| "Čepický"| "Cerén"| "César"| "Céspedes"| "Cézanne"| "Chacón"| "Chaunté"| "Chávez"| "Chihuán"| "Chloé"| "Chrétien"| "Cibrián"| "Cintrón"| "Cíosóig"| "Cissé"| "Clélia"| "Clémence"| "Clément"| "Clévenot"| "Colón"| "Compaoré"| "Conceição"| "Concepción"| "Condé"| "Córdoba"| "Cordón"| "Córdova"| "Cortés"| "Crépeau"| "Cristóbal"| "Cubillán"| "Cué"| "Cuétara"| "Cynné"| "Czaková"| "Czigány"| "Daabousová"| "Dallapé"| "Dániel"| "Danièle"| "Danté"| "Dávalos"| "Dávid"| "DawnCheré"| "Débora"| "Déborah"| "Déby"| "Décary"| "Delía"| "Dembélé"| "Dénes"| "Dépré"| "DerlisRamón"| "Dési"| "Desirée"| "Desrosières"| "Díaz"| "Diémé"| "Dièye"| "Dilmé"| "Djá"| "Djénébou"| "Dolínek"| "Domínguez"| "Donté"| "Dóra"| "Dorjsürengiin"| "Dostál"| "Duchonová"| "Ducó"| "Dueñas"| "Dukátová"| "Durán"| "Dvorák"| "Echávarri"| "Echevarría"| "Éder"| "Édgar"| "Ekateríni"| "Élodie"| "Elphége"| "Émane"| "Émile"| "Emilíana"| "Émilie"| "Épangue"| "Erdélyi"| "Ergüven"| "Érica"| "Érick"| "Érika"| "España"| "Espíndola"| "Étienne"| "Eugénie"| "Eurén"| "Éva"| "Éve"| "Évora"| "Fabián"| "Fábio"| "Fabíola"| "Fagúndez"| "Fältskog"| "Fariña"| "Felício"| "Félix"| "Ferencová"| "Fernández"| "Flávia"| "Flesjå"| "Flóra"| "Florenç"| "Flügel"| "Flüggen"| "Foldházi"| "François"| "Françoise"| "Frédéric"| "Frédérick"| "Frisé"| "Fürste"| "Gábor"| "Gádorfalvi"| "Gagné"| "Gáliková"| "Gándara"| "Garbiñe"| "García"| "Garrigós"| "Gascón"| "Gáspár"| "Gastón"| "Gaudí"| "Gélineau"| "Geneviève"| "Gérard"| "Germán"| "Gerónimo"| "Géroudet"| "Gévrise"| "Giménez"| "Ginóbili"| "Gnassingbé"| "Gomà"| "Gómez"| "Gonçalves"| "Göncz"| "González"| "Göran"| "Grátz"| "Grégory"| "Grévy"| "Grimké"| "Grimsbö"| "Grímsson"| "Grönberg"| "Grövdal"| "Guillén"| "Güldeniz"| "Gülec"| "Gulldén"| "Gümbel"| "Gündegmaa"| "Günes"| "Günther"| "Gutiérrez"| "Güvenc"| "Guzmán"| "György"| "Gyurcsány"| "Häfner"| "Háido"| "Håkan"| "Hambüchen"| "Hamchétou"| "Hárai"| "Härstedt"| "Håvard"| "Havlát"| "Héléna"| "Hélene"| "Hendrychová"| "Hernán"| "Hernández"| "Hernangómez"| "Hervé"| "Hidvégi"| "Higuaín"| "Hinriksdóttir"| "Hjálmsdóttir"| "Holingerová"| "Holló"| "Horváth"| "Hosnyánszky"| "Hosszú"| "Hrasnová"| "Hristóforos"| "Hrivnák"| "Hufnágel"| "Hultén"| "Hüseyin"| "Hypólito"| "Hyryläinen"| "Ibañez"| "Ibargüen"| "Idéhn"| "Ié"| "Illés"| "Inácio"| "Iñárritu"| "Inés"| "István"| "Iván"| "Jackée"| "Jágr"| "Jakubský"| "Jámison"| "Jämsä"| "Janatková"| "János"| "Járóka"| "Jaurès"| "Jeremiáš"| "Jérémie"| "Jérémy"| "Jérent"| "Jérome"| "Jéssica"| "Jesús"| "Jhené"| "Jiménez"| "Jiří"| "João"| "Joaquín"| "Joëlle"| "Jóhannsson"| "Jonatán"| "JonBenét"| "Jördis"| "Jorén"| "Josée"| "Josué"| "Jóźwiak"| "Juhász"| "Júlio"| "Júnior"| "Juppé"| "Jürgen"| "Jurinová"| "Kaboré"| "Kafétien"| "Kaká"| "Kalovský"| "Kapás"| "Karlström"| "Karolína"| "Kasó"| "Katarína"| "Kätlin"| "Kévin"| "Kemrová"| "Késely"| "Kévin"| "Khloé"| "Khüderbulga"| "Kléber"| "Kléberson"| "Klobucník"| "Klocová"| "Klöden"| "Kněžínková"| "Köbrich"| "Köhler"| "Kohlová"| "Koňařík"| "Kořán"| "Kovács"| "Kovágó"| "Kozák"| "Krejčí"| "Kristián"| "Krisztián"| "Krizsán"| "Krüger"| "Kühn"| "Kühne"| "Kylliäinen"| "Laanmäe"| "Labbé"| "Laferrière"| "Laprovíttola"| "Larrañaga"| "László"| "Lázaro"| "Léa"| "Léandre"| "Lefèvre"| "Leitón"| "Lemprière"| "León"| "Lepistö"| "Lerú"| "Lidström"| "Lillána"| "Listopadová"| "Liván"| "Lívia"| "Lloréns"| "Lluís"| "Löke"| "Longová"| "López"| "Lotiès"| "Lövnes"| "Lü"| "Lübeck"| "Lucía"| "Lückenkemper"| "Luís"| "Lukás"| "Lukáš"| "Lúthersdóttir"| "Madaí"| "Madarász"| "Mäe"| "Magallán"| "Mägi"| "Mahé"| "Maíla"| "Majdán"| "Mäkelä"| "Mandátová"| "Mané"| "Mangué"| "Marc-André's"| "Márcio"| "Maréchal"| "Marí"| "Mária"| "María"| "Marílson"| "Marín"| "Mariño"| "Mário"| "Márk"| "Marozsán"| "Márquez"| "Martí"| "Martín"| "Martínek"| "Martínez"| "Márton"| "Massó"| "Mätas"| "Máté"| "Matías"| "Matús"| "Maurício"| "Máximo"| "Meité"| "Mélanie"| "Mélina"| "Méline"| "Méndez"| "Meroúsis"| "Micheál"| "Michèle"| "Mihaíl"| "Mijaín"| "Miklós"| "Millán"| "Miltiádis"| "Moisés"| "Mokosová"| "Molnár"| "Mónaco"| "Monáe"| "Mónica"| "Mónika"| "Montaño"| "Morén"| "Mörk"| "Mörö"| "Müller"| "Muñiz"| "Muñoz"| "Murúa"| "Nádia"| "Naïm"| "Natália"| "Negrón"| "Németh"| "Néstor"| "Niccolò"| "Nicolás"| "Niinistö"| "Nóbrega"| "Noélie"| "Noémie"| "Nordén"| "Núbia"| "Nuñez"| "Ódorová"| "Öhrström"| "Ólafur"| "Óleo"| "Opatrný"| "Orbán"| "Ordóñez"| "Oréane"| "Ortíz"| "Óscar"| "Ozlü"| "Ozyüksel"| "Pääbo"| "Pabón"| "Padacké"| "Pádraig"| "Páez"| "Pajón"| "Pál"| "Palát"| "Panayióta"| "Pär"| "Paré"| "Pärt"| "Patiño"| "Patrícia"| "Patrocínio"| "Pattantyús"| "Pavón"| "Péché"| "Péchoux"| "Pelikán"| "Peña"| "Peñate"| "Pénélope"| "Péni"| "Pépin"| "Pérez"| "Perón"| "Pétain"| "Petchamé"| "Péter"| "Pétervári"| "Philémon"| "Phúc"| "Piétrus"| "Pinzón"| "Pité"| "Pitkämäki"| "Poésy"| "Pokorný"| "Polívka"| "Póta"| "Préval"| "Prokopová"| "Puigcercós"| "Pürevjargalyn"| "Putálová"| "Quiñones"| "Quiñonez"| "Quintillà"| "Quvenzhané"| "Rácz"| "Ramírez"| "Raúl"| "Řebíček"| "Récsei"| "Rédli"| "Réka"| "Rémi"| "Renáta"| "Rendón"| "René"| "Renée"| "Rénelle"| "Rentería"| "Repcík"| "Reséndiz"| "Rézola"| "Ribéry"| "Richárd"| "Ríga"| "Robenílson"| "Róbert"| "Róchez"| "Rocío"| "Rodríguez"| "Rogério"| "Rolfö"| "Román"| "Romová"| "Rónald"| "Rosário"| "Rubén"| "Rühr"| "Ruíz"| "Sá"| "Saborío"| "Sagardía"| "Sallói"| "Salomé"| "Salvadó"| "Samassékou"| "Sánchez"| "Sandé"| "Sándor"| "Sardá"| "Sárosi"| "Sátila"| "Saúl"| "Saunière"| "Savón"| "Scalamandré"| "Schächter"| "Schäfer"| "Schäuble"| "Schlögl"| "Schmiedlová"| "Schön"| "Schröder"| "Schüpbac"| "Schüssel"| "Schütze"| "Séamus"| "Seán"| "Sebastián"| "Sébastien"| "Sebestyén"| "Sélom"| "Sène"| "Senyürek"| "Seppälä"| "Sepúlveda"| "Sérgio"| "Shkëlzen"| "Sicília"| "Silfvén"| "Siljamäki"| "Sinéad"| "Sjåstad"| "Sjöberg"| "Sjödin"| "Sjöström"| "Skantár"| "Söderberg"| "Söderling"| "Sofía"| "Solé"| "Solís"| "Söllner"| "Somorácz"| "Sörenstam"| "Ståhl"| "Ståle"| "Stefanídi"| "Stéphane"| "Stéphanie"| "Strålman"| "Strömberg"| "Stübe"| "Studničková"| "Suárez"| "Šuláková"| "Süle"| "Süleyman"| "Švácha"| "Svärd"| "Svennerstål"| "Szabián"| "Szabó"| "Szász"| "Szilágyi"| "Szomolányi"| "Szücs"| "Szwarnóg"| "Taaramäe"| "Tabaré"| "Tainá"| "Takács"| "Támara"| "Tamás"| "Tarragó"| "Tazegül"| "Tcheuméo"| "Tchórz"| "Téa"| "Tentóglou"| "Teófilo"| "Teré"| "Tévez"| "Thaísa"| "Théo"| "Théophile"| "Thérèse"| "Théry"| "Thiéry"| "Tímea"| "Tió"| "Todenhöfer"| "Tomáš"| "Tomorkhüleg"| "Tõnu"| "Topolánek"| "Tormé"| "Tornéus"| "Törnroos"| "Török"| "Tórrez"| "Tórtola"| "Tóth"| "Touadéra"| "Tramèr"| "Traoré"| "Träsch"| "Trévor"| "Tsinopoúlou"| "Túñez"| "Türk"| "Tüvshinbat"| "Tüvshinbayar"| "Üitümen"| "Ünal"| "Ungvári"| "Urán"| "Úrsula"| "Üstündag"| "Václav"| "Valdés"| "Valentín"| "Valérian"| "Valériane"| "Välimäki"| "Vallée"| "Vámos"| "Vásquez"| "Vázquez"| "Velázquez"| "Veldáková"| "Venyercsán"| "Veréb"| "Verón"| "Verrasztó"| "Víctor"| "Victória"| "Viktória"| "Vilató"| "Villaécija"| "Villafría"| "Vinícius"| "Viñolas"| "Vitória"| "Vladimír"| "Wallén"| "Wálter"| "Wanyá"| "Wé"| "Wéverton"| "Wikström"| "Xénia"| "Yáñez"| "Younés"| "Zagré"| "Zalánki"| "Zelená"| "Zélia"| "Zoltán"| "Zságer"| "Zsófia");

skvadrik commented 5 years ago

Thanks! I added a test (it returns 0 for all the names on the list): https://github.com/skvadrik/re2c/blob/a00dc4871106ea39ef84f47bb840a018b17cea25/test/encodings/utf8_names.i8--input-encoding(utf8).re

There is an error in the name Ibargüen, it has a strange C2 byte right before C3 BC representing ü. It doesn't look like valid UTF-8 to me. After deleting C2 from both places everything works fine.

terpstra commented 5 years ago

This is great! When can we expect the next re2c release? I can't wait to re2c:include a Unicode character classes library and define character classes with literal UTF8 strings in them!

skvadrik commented 5 years ago

Soon, soon, really soon! I know I said this a couple of times before, such a shame... /o\ Realistically, not earlier than in 2 weeks, not later than the end of July. Thanks for asking, it gives me the inspiration to start writing changelog. :)