unicode-rs / unicode-security

Detect possible security problems with Unicode usage according to Unicode Technical Standard #39 rules.
Other
14 stars 4 forks source link

Add `is_potential_mixed_script_confusable_char` function #13

Closed crlf0710 closed 4 years ago

crlf0710 commented 4 years ago

This is a prototype of the data required from mixed_script_confusable lint of rustc. I'm not really sure whether we need to give more detailed data to rustc for diagnostics. (e.g. This code point is potentially confusable with which code point or which script?)

Putting those aside, maybe we can have a early review first.

Implements is_potential_mixed_script_confusable_char function.

A few other issues raised up during adding the actual lint - #15 #16

crlf0710 commented 4 years ago

There's a debug boolean value in the python script (search "debug = False", https://github.com/unicode-rs/unicode-security/pull/13/files#diff-c87c196441d88317a5ea3bf97e9fde0aR536), if anyone's curious why these codepoints are confusable, you can toggle it and regenerate the table file, which will include comments on why these code points are considered mixed script confusable.

crlf0710 commented 4 years ago

The python calculation part is a little complex there. So i'll leave a brief description. The main idea is creating equivalence classes from confusables.txt, with the prototype as each equivalence class's representative element. Then compare each element pair within each equivalence class. If they're from different scripts, then mark each of them potentially mixed script confusable. And within that there're some special handling when the prototype has multiple code points.

crlf0710 commented 4 years ago

@Manishearth Will add some comments, thanks!

crlf0710 commented 4 years ago

Added some comments. Also cc @pyfisch here.

crlf0710 commented 4 years ago

Adjusted the APIs and added a very simple test.

crlf0710 commented 4 years ago

Added more comments to document the details.