unicode-rs / unicode-security

Detect possible security problems with Unicode usage according to Unicode Technical Standard #39 rules.
Other
14 stars 4 forks source link

Support obtaining Script_Extensions of a character #4

Closed Manishearth closed 4 years ago

Manishearth commented 4 years ago

This is needed for mixed script detection.

The easy way to do this is just to store a slice of script_extensions for each code point / range, but there's actually a limited set of ways script_extensions mix (taken from here):

``` Adlam (Adlam), Adlam,Arabic,Hanifi_Rohingya,Mandaic,Manichaean,Psalter_Pahlavi,Sogdian,Syriac (Adlam,Arabic,Hanifi_Rohingya,Mandaic,Manichaean,Psalter_Pahlavi,Sogdian,Syriac), Ahom (Ahom), Anatolian_Hieroglyphs (Anatolian_Hieroglyphs), Arabic (Arabic), Arabic,Coptic (Arabic,Coptic), Arabic,Hanifi_Rohingya (Arabic,Hanifi_Rohingya), Arabic,Hanifi_Rohingya,Syriac,Thaana (Arabic,Hanifi_Rohingya,Syriac,Thaana), Arabic,Syriac (Arabic,Syriac), Arabic,Syriac,Thaana (Arabic,Syriac,Thaana), Arabic,Thaana (Arabic,Thaana), Armenian (Armenian), Armenian,Georgian (Armenian,Georgian), Avestan (Avestan), Balinese (Balinese), Bamum (Bamum), Bassa_Vah (Bassa_Vah), Batak (Batak), Bengali (Bengali), Bengali,Chakma,Syloti_Nagri (Bengali,Chakma,Syloti_Nagri), Bengali,Devanagari (Bengali,Devanagari), Bengali,Devanagari,Dogra,Grantha,Gujarati,Gunjala_Gondi,Gurmukhi,Kannada,Khudawadi,Limbu,Mahajani,Malayalam,Masaram_Gondi,Nandinagari,Oriya,Sinhala,Syloti_Nagri,Takri,Tamil,Telugu,Tirhuta (Bengali,Devanagari,Dogra,Grantha,Gujarati,Gunjala_Gondi,Gurmukhi,Kannada,Khudawadi,Limbu,Mahajani,Malayalam,Masaram_Gondi,Nandinagari,Oriya,Sinhala,Syloti_Nagri,Takri,Tamil,Telugu,Tirhuta), Bengali,Devanagari,Dogra,Grantha,Gujarati,Gunjala_Gondi,Gurmukhi,Kannada,Khudawadi,Mahajani,Malayalam,Masaram_Gondi,Nandinagari,Oriya,Sinhala,Syloti_Nagri,Takri,Tamil,Telugu,Tirhuta (Bengali,Devanagari,Dogra,Grantha,Gujarati,Gunjala_Gondi,Gurmukhi,Kannada,Khudawadi,Mahajani,Malayalam,Masaram_Gondi,Nandinagari,Oriya,Sinhala,Syloti_Nagri,Takri,Tamil,Telugu,Tirhuta), Bengali,Devanagari,Grantha,Gujarati,Gurmukhi,Kannada,Latin,Malayalam,Oriya,Sharada,Tamil,Telugu,Tirhuta (Bengali,Devanagari,Grantha,Gujarati,Gurmukhi,Kannada,Latin,Malayalam,Oriya,Sharada,Tamil,Telugu,Tirhuta), Bengali,Devanagari,Grantha,Gujarati,Gurmukhi,Kannada,Latin,Malayalam,Oriya,Tamil,Telugu,Tirhuta (Bengali,Devanagari,Grantha,Gujarati,Gurmukhi,Kannada,Latin,Malayalam,Oriya,Tamil,Telugu,Tirhuta), Bengali,Devanagari,Grantha,Kannada (Bengali,Devanagari,Grantha,Kannada), Bengali,Devanagari,Grantha,Kannada,Nandinagari,Oriya,Telugu,Tirhuta (Bengali,Devanagari,Grantha,Kannada,Nandinagari,Oriya,Telugu,Tirhuta), Bhaiksuki (Bhaiksuki), Bopomofo (Bopomofo), Bopomofo,Han (Bopomofo,Han), Bopomofo,Han,Hangul,Hiragana,Katakana (Bopomofo,Han,Hangul,Hiragana,Katakana), Bopomofo,Han,Hangul,Hiragana,Katakana,Yi (Bopomofo,Han,Hangul,Hiragana,Katakana,Yi), Brahmi (Brahmi), Braille (Braille), Buginese (Buginese), Buginese,Javanese (Buginese,Javanese), Buhid (Buhid), Buhid,Hanunoo,Tagalog,Tagbanwa (Buhid,Hanunoo,Tagalog,Tagbanwa), Canadian_Aboriginal (Canadian_Aboriginal), Carian (Carian), Caucasian_Albanian (Caucasian_Albanian), Chakma (Chakma), Chakma,Myanmar,Tai_Le (Chakma,Myanmar,Tai_Le), Cham (Cham), Cherokee (Cherokee), Common (Common), Coptic (Coptic), Cuneiform (Cuneiform), Cypriot (Cypriot), Cypriot,Linear_A,Linear_B (Cypriot,Linear_A,Linear_B), Cypriot,Linear_B (Cypriot,Linear_B), Cyrillic (Cyrillic), Cyrillic,Glagolitic (Cyrillic,Glagolitic), Cyrillic,Latin (Cyrillic,Latin), Cyrillic,Old_Permic (Cyrillic,Old_Permic), Deseret (Deseret), Devanagari (Devanagari), Devanagari,Dogra,Gujarati,Gurmukhi,Kaithi,Kannada,Khojki,Khudawadi,Mahajani,Malayalam,Modi,Nandinagari,Takri,Tirhuta (Devanagari,Dogra,Gujarati,Gurmukhi,Kaithi,Kannada,Khojki,Khudawadi,Mahajani,Malayalam,Modi,Nandinagari,Takri,Tirhuta), Devanagari,Dogra,Gujarati,Gurmukhi,Kaithi,Kannada,Khojki,Khudawadi,Mahajani,Modi,Nandinagari,Takri,Tirhuta (Devanagari,Dogra,Gujarati,Gurmukhi,Kaithi,Kannada,Khojki,Khudawadi,Mahajani,Modi,Nandinagari,Takri,Tirhuta), Devanagari,Dogra,Gujarati,Gurmukhi,Kaithi,Khojki,Khudawadi,Mahajani,Modi,Takri,Tirhuta (Devanagari,Dogra,Gujarati,Gurmukhi,Kaithi,Khojki,Khudawadi,Mahajani,Modi,Takri,Tirhuta), Devanagari,Dogra,Kaithi,Mahajani (Devanagari,Dogra,Kaithi,Mahajani), Devanagari,Grantha (Devanagari,Grantha), Devanagari,Grantha,Kannada (Devanagari,Grantha,Kannada), Devanagari,Grantha,Latin (Devanagari,Grantha,Latin), Devanagari,Kannada,Malayalam,Oriya,Tamil,Telugu (Devanagari,Kannada,Malayalam,Oriya,Tamil,Telugu), Devanagari,Nandinagari (Devanagari,Nandinagari), Devanagari,Sharada (Devanagari,Sharada), Devanagari,Tamil (Devanagari,Tamil), Dogra (Dogra), Duployan (Duployan), Egyptian_Hieroglyphs (Egyptian_Hieroglyphs), Elbasan (Elbasan), Elymaic (Elymaic), Ethiopic (Ethiopic), Georgian (Georgian), Georgian,Latin (Georgian,Latin), Glagolitic (Glagolitic), Gothic (Gothic), Grantha (Grantha), Grantha,Tamil (Grantha,Tamil), Greek (Greek), Gujarati (Gujarati), Gujarati,Khojki (Gujarati,Khojki), Gunjala_Gondi (Gunjala_Gondi), Gurmukhi (Gurmukhi), Gurmukhi,Multani (Gurmukhi,Multani), Han (Han), Han,Hiragana,Katakana (Han,Hiragana,Katakana), Hangul (Hangul), Hanifi_Rohingya (Hanifi_Rohingya), Hanunoo (Hanunoo), Hatran (Hatran), Hebrew (Hebrew), Hiragana (Hiragana), Hiragana,Katakana (Hiragana,Katakana), Imperial_Aramaic (Imperial_Aramaic), Inherited (Inherited), Inscriptional_Pahlavi (Inscriptional_Pahlavi), Inscriptional_Parthian (Inscriptional_Parthian), Javanese (Javanese), Kaithi (Kaithi), Kannada (Kannada), Kannada,Nandinagari (Kannada,Nandinagari), Katakana (Katakana), Kayah_Li (Kayah_Li), Kayah_Li,Latin,Myanmar (Kayah_Li,Latin,Myanmar), Kharoshthi (Kharoshthi), Khmer (Khmer), Khojki (Khojki), Khudawadi (Khudawadi), Lao (Lao), Latin (Latin), Latin,Mongolian (Latin,Mongolian), Lepcha (Lepcha), Limbu (Limbu), Linear_A (Linear_A), Linear_B (Linear_B), Lisu (Lisu), Lycian (Lycian), Lydian (Lydian), Mahajani (Mahajani), Makasar (Makasar), Malayalam (Malayalam), Mandaic (Mandaic), Manichaean (Manichaean), Marchen (Marchen), Masaram_Gondi (Masaram_Gondi), Medefaidrin (Medefaidrin), Meetei_Mayek (Meetei_Mayek), Mende_Kikakui (Mende_Kikakui), Meroitic_Cursive (Meroitic_Cursive), Meroitic_Hieroglyphs (Meroitic_Hieroglyphs), Miao (Miao), Modi (Modi), Mongolian (Mongolian), Mongolian,Phags_Pa (Mongolian,Phags_Pa), Mro (Mro), Multani (Multani), Myanmar (Myanmar), Nabataean (Nabataean), Nandinagari (Nandinagari), New_Tai_Lue (New_Tai_Lue), Newa (Newa), Nko (Nko), Nushu (Nushu), Nyiakeng_Puachue_Hmong (Nyiakeng_Puachue_Hmong), Ogham (Ogham), Ol_Chiki (Ol_Chiki), Old_Hungarian (Old_Hungarian), Old_Italic (Old_Italic), Old_North_Arabian (Old_North_Arabian), Old_Permic (Old_Permic), Old_Persian (Old_Persian), Old_Sogdian (Old_Sogdian), Old_South_Arabian (Old_South_Arabian), Old_Turkic (Old_Turkic), Oriya (Oriya), Osage (Osage), Osmanya (Osmanya), Pahawh_Hmong (Pahawh_Hmong), Palmyrene (Palmyrene), Pau_Cin_Hau (Pau_Cin_Hau), Phags_Pa (Phags_Pa), Phoenician (Phoenician), Psalter_Pahlavi (Psalter_Pahlavi), Rejang (Rejang), Runic (Runic), Samaritan (Samaritan), Saurashtra (Saurashtra), Sharada (Sharada), Shavian (Shavian), Siddham (Siddham), Sign_Writing (Sign_Writing), Sinhala (Sinhala), Sogdian (Sogdian), Sora_Sompeng (Sora_Sompeng), Soyombo (Soyombo), Sundanese (Sundanese), Syloti_Nagri (Syloti_Nagri), Syriac (Syriac), Tagalog (Tagalog), Tagbanwa (Tagbanwa), Tai_Le (Tai_Le), Tai_Tham (Tai_Tham), Tai_Viet (Tai_Viet), Takri (Takri), Tamil (Tamil), Tangut (Tangut), Telugu (Telugu), Thaana (Thaana), Thai (Thai), Tibetan (Tibetan), Tifinagh (Tifinagh), Tirhuta (Tirhuta), Ugaritic (Ugaritic), Unknown (Unknown), Vai (Vai), Wancho (Wancho), Warang_Citi (Warang_Citi), Yi (Yi), Zanabazar_Square (Zanabazar_Square) ```

We can very easily make a single enum value for each one, and programmatically generate an intersect() function that can calculate intersections. This would be faster.

(For performance it would also probably be worth only running these checks on non-ascii identifiers)

Manishearth commented 4 years ago

I might create a separate unicode-script crate for the guts of this.

Manishearth commented 4 years ago

https://github.com/unicode-rs/unicode-script

Manishearth commented 4 years ago

https://github.com/unicode-rs/unicode-security/pull/6