Regex: unicode - Githubissues

kalessil / phpinspectionsea

A Static Code Analyzer for PHP (a PhpStorm/Idea Plugin)

https://plugins.jetbrains.com/plugin/7622?pr=phpStorm

Other

1.44k stars 118 forks source link

Regex: unicode #1299

Open voku opened 5 years ago

voku commented 5 years ago

Description:

It would be nice, if you can warn us that we should use unicode regex instead of ascii regex.

Example

ascii: https://regex101.com/r/lM5Zy8/1

(?<before>[^\w]|^)(?:onstop)(?<after>\s|[^\w]|$)

unicode: https://regex101.com/r/SJ4oDG/1

(?<before>[^\p{L}]|^)(?:onstop)(?<after>\s|[^\p{L}]|$)

Unicode Regex via PHP: https://youtu.be/VRiF9xd0YQc?t=3264

kalessil commented 5 years ago

Not really following the idea... The first one will work when u-modifier is added (-> I can report if \W, \w used without u-modifier). The second one ignores the e.g. umlauts and co.

What was the root cause?

voku commented 5 years ago

My problem seems to be the missing u-modifier. 😊 So a hint for this is maybe a good idea?

Question: In the Yii string class (https://github.com/yiisoft/string/blob/master/src/StringHelper.php) I saw that they use the u-modifier also for cases where they only use "\s"? Do we really need to add the u-modifier also in this cases?

kalessil commented 5 years ago

Ideally, any of \s, \S, \w, \W, \d, \D usage needs to be backed with u-modifier. Yes, if app should work properly with Unicode (because of Arabic numbers and additional space-characters in Unicode).

voku commented 5 years ago

Today I replaced some "[A-Z]" regex stuff with "\p{Lu}" (https://github.com/voku/portable-utf8/commit/98cca6387503f9c8b3bb54ed97350e9fac140941), so that I can process unicode chars, maybe hints like that are also helpfully?

kalessil commented 5 years ago

Gladly, in which cases (we have multiple options now) what? =)

samdark commented 5 years ago

So the checks are:

If there's u modifier and \w is used suggest using \pL because former matches ASCII only.
If there's u modifier and \d is used suggest using \pN because former matches ASCII only.
If there's u modifier and \s is used suggest using \pZ because former matches ASCII only.
If there are unicode characters, suggest adding u.
If there's no u modifier and there's a HEX value greater than \x{FF}, suggest adding u. It would error without it.
If there's no HEX value greater than \x{FF} or any \p* or any unicode characters, suggest removing u. This one is harmless so if not implemented, there's no big loss.

voku commented 5 years ago

@samdark \s contains also e.g. \t but \p{Z} did not. https://regex101.com/r/fPsz0y/2

samdark commented 5 years ago

Actually, 6. isn't needed because u can simply mean input is expected to be unicode. Thus it would be annoying.