Open teoric opened 9 years ago
I can reproduce this with ag master (0.31.0) on Ubuntu 14.04, with en_US.utf8
locale.
I think the underlying problem is that ag's regex matcher treats all files as byte streams rather than Unicode codepoint streams.
For example, on my UTF-8 system, the "ü" would be represented as two bytes ('\xc3\xbc'
). With grep
, a .
will match the entire character, but with ag
, a .
will match one of the two bytes, splitting the character:
$ echo darüber > file; ag --only-matching .ber file; grep --only-matching .ber file
�ber
über
+1
I am trying to look for the presence of a word containing non-ASCII characters, and this is not possible:
The first should find me all lines containing a word starting with
über
, the second should find lines whereüber
is a single word, shouldn't it? Texts are UTF-8, and dropping the boundaries gives thousands of results, as does searching withpcregrep
.The first line returns only lines containing words that contain über (such as
darüberhinaus
) and the second one those containing words ending inüber
(such asdarüber
), which seems to suggest that the boundary matches beforeü
, i.e.ü
is not counted as a word character (but should be).Locale is set to "de_DE.UTF-8", but unsetting it does not change anything.
(ag 0.19.2 with Ubuntu 14.04, and 0.30.0 on Mac OS X 10.11)