Closed crlf0710 closed 4 years ago
There's a debug boolean value in the python script (search "debug = False", https://github.com/unicode-rs/unicode-security/pull/13/files#diff-c87c196441d88317a5ea3bf97e9fde0aR536), if anyone's curious why these codepoints are confusable, you can toggle it and regenerate the table file, which will include comments on why these code points are considered mixed script confusable.
The python calculation part is a little complex there. So i'll leave a brief description. The main idea is creating equivalence classes from confusables.txt
, with the prototype as each equivalence class's representative element. Then compare each element pair within each equivalence class. If they're from different scripts, then mark each of them potentially mixed script confusable. And within that there're some special handling when the prototype has multiple code points.
@Manishearth Will add some comments, thanks!
Added some comments. Also cc @pyfisch here.
Adjusted the APIs and added a very simple test.
Added more comments to document the details.
This is a prototype of the data required from mixed_script_confusable lint of rustc. I'm not really sure whether we need to give more detailed data to rustc for diagnostics. (e.g. This code point is potentially confusable with which code point or which script?)Putting those aside, maybe we can have a early review first.
Implements
is_potential_mixed_script_confusable_char
function.A few other issues raised up during adding the actual lint - #15 #16