perl11 / cperl

A perl5 with classes, types, compilable, company friendly, security
http://perl11.org/
Other
142 stars 17 forks source link

Unicode mixed script confusables #229

Closed rurban closed 7 years ago

rurban commented 7 years ago

In order to avoid TR39 confusable security hacks, we add the following unicode rules for identifiers and literals:

  1. The first non-Latin and not-Common unicode script for an identifier is the only allowed one. Others lead to parsers errors.
  2. Additional unicode scripts can and should be declared via `use utf8 'Greek', 'script-name2'... to prevent mixed script errors. This allows more scripts than in rule 1. This can be scoped in blocks.
  3. The 'Common' and 'Latin' scripts are always enabled and don't need to be declared.

See http://www.unicode.org/reports/tr39/#Mixed_Script_Detection

This holds for all identifiers (all names: package, gv, sub, variables) and literal numbers. The scriptname is returned by Unicode::UCD::charscript($codepoint_as_uv)

Currently there exist 131 scripts: perl -alne'/; (\w+) #/ && print $1' lib/unicore/Scripts.txt | sort -u > scripts.lst

Ahom
Anatolian_Hieroglyphs
Arabic
Armenian
Avestan
Balinese
Bamum
Bassa_Vah
Batak
Bengali
Bopomofo
Brahmi
Braille
Buginese
Buhid
Canadian_Aboriginal
Carian
Caucasian_Albanian
Chakma
Cham
Cherokee
Common
Coptic
Cuneiform
Cypriot
Cyrillic
Deseret
Devanagari
Duployan
Egyptian_Hieroglyphs
Elbasan
Ethiopic
Georgian
Glagolitic
Gothic
Grantha
Greek
Gujarati
Gurmukhi
Han
Hangul
Hanunoo
Hatran
Hebrew
Hiragana
Imperial_Aramaic
Inherited
Inscriptional_Pahlavi
Inscriptional_Parthian
Javanese
Kaithi
Kannada
Katakana
Kayah_Li
Kharoshthi
Khmer
Khojki
Khudawadi
Lao
Latin
Lepcha
Limbu
Linear_A
Linear_B
Lisu
Lycian
Lydian
Mahajani
Malayalam
Mandaic
Manichaean
Meetei_Mayek
Mende_Kikakui
Meroitic_Cursive
Meroitic_Hieroglyphs
Miao
Modi
Mongolian
Mro
Multani
Myanmar
Nabataean
New_Tai_Lue
Nko
Ogham
Ol_Chiki
Old_Hungarian
Old_Italic
Old_North_Arabian
Old_Permic
Old_Persian
Old_South_Arabian
Old_Turkic
Oriya
Osmanya
Pahawh_Hmong
Palmyrene
Pau_Cin_Hau
Phags_Pa
Phoenician
Psalter_Pahlavi
Pau_Cin_Hau
Phags_Pa
Phoenician
Psalter_Pahlavi
Rejang
Runic
Samaritan
Saurashtra
Sharada
Shavian
Siddham
SignWriting
Sinhala
Sora_Sompeng
Sundanese
Syloti_Nagri
Syriac
Tagalog
Tagbanwa
Tai_Le
Tai_Tham
Tai_Viet
Takri
Tamil
Telugu
Thaana
Thai
Tibetan
Tifinagh
Tirhuta
Ugaritic
Vai
Warang_Citi
Yi
rurban commented 7 years ago

The remaining question if certain languages need alias for sets of Scripts, because they use multiple scripts by default. Such as Japanese for Hiragana and Katakana (what about Kanji? = Han?), Korean for Hangul and Han (Chinese).