gskinner / regexr

RegExr is a HTML/JS based tool for creating, testing, and learning about Regular Expressions.
http://regexr.com/
GNU General Public License v3.0
9.89k stars 972 forks source link

Details tab seems to be mis-highlighting groups #269

Open claudiobrandt opened 6 years ago

claudiobrandt commented 6 years ago

Hi, thanks for this tool! regexr.com/3rg2m While the Replace tab shows the regular expression is working fine, the Details tab has the highlighted groups in a way that one character ("-'), which should belong in the last group, as part of the to-be-replaced character ("×-" instead of only "×"). image

gskinner commented 6 years ago

This seems to be related to the high ascii × character you're using (code=158). Replacing it with z in both the Expression and the Text fixes the issue. A quick test shows that it happens with other high ascii characters like ÿ, ç, and ©.

We'll have to do some testing and see if its something we're doing, or something inherent in PCRE.

gskinner commented 6 years ago

Here's a more concise reproduction of the issue: https://regexr.com/3rggr

wdamien commented 6 years ago

This comes back to how php's pcre engine is implemented. It doesn't use the character index when returning the index of a match, instead the byte offset is used. That means when running on UTF-8 encoded text that index can be off by 1 or more based on the character being used.

Example using a multibyte Chinese character: https://regexr.com/3rlqh

For reference: https://bugs.php.net/bug.php?id=37391

So we'll have to look into manually setting the match offset.

gskinner commented 6 years ago

This also affects highlighting: https://regexr.com/40u3b

LJISAMoige commented 5 years ago

https://regexr.com/48d2t There is a duplication of parts of the text in the Details tab that seems to be due to the nesting of non-capturing groups in PCRE mode (Edit: Or just nesting. See https://regexr.com/48d9u (a|(c)) triplicates the c off 'ca' but not 'ac')