hops / pack2

MIT License
34 stars 2 forks source link

Properly handle UTF-8 characters #2

Open hops opened 4 years ago

hops commented 4 years ago

Currently we treat any byte outside of 0x20 - 0x7e as the mask character ?b. This is not ideal as we already know we don't have to check ?a (which is, of course, also part of ?b). (Still more accurate than PACK which uses ?s.) Rust has native support for UTF-8 strings but it's to slow for us. The current idea is to check if at least on byte is outside of the ?a range and handle these in a slow path. Once we have a validated UTF-8 character we map it to it's Unicode block Mapping a Unicode block to a mask is be possible using custom charsets in combination with the --hex-charset flag. Example input: Röschti ö is part of the Latin-1 Supplement block. This block in UTF-8 encoding ranges from [c2,c3] [80-bf] therefore our custom charsets would be ?1 c2c3 and ?2 808182...bf Full mask:

c2c3,808182838485868788898a8b8c8d8e8f909192939495969798999a9b9c9d9e9fa0a1a2a3a4a5a6a7a8a9aaabacadaeafb0b1b2b3b4b5b6b7b8b9babbbcbdbebf,?u?1?2?l?l?l?l?l

We could even go further and detect the that it in the "sub-block" letters and only use this in our mask. This is a very basic example of how I think about this problem. I'm totally aware there will be cases which aren't this simple. This whole idea isn't set in stone and I'm open to any ideas and suggestions.