Word boundaries with non-ASCII words

teoric commented 9 years ago

I am trying to look for the presence of a word containing non-ASCII characters, and this is not possible:

ag '\büber'
ag -w über

The first should find me all lines containing a word starting with über, the second should find lines where über is a single word, shouldn't it? Texts are UTF-8, and dropping the boundaries gives thousands of results, as does searching with pcregrep.

The first line returns only lines containing words that contain über (such as darüberhinaus) and the second one those containing words ending in über (such as darüber), which seems to suggest that the boundary matches before ü, i.e. ü is not counted as a word character (but should be).

Locale is set to "de_DE.UTF-8", but unsetting it does not change anything.

(ag 0.19.2 with Ubuntu 14.04, and 0.30.0 on Mac OS X 10.11)

joliss commented 9 years ago

I can reproduce this with ag master (0.31.0) on Ubuntu 14.04, with en_US.utf8 locale.

I think the underlying problem is that ag's regex matcher treats all files as byte streams rather than Unicode codepoint streams.

For example, on my UTF-8 system, the "ü" would be represented as two bytes ('\xc3\xbc'). With grep, a . will match the entire character, but with ag, a . will match one of the two bytes, splitting the character:

$ echo darüber > file; ag --only-matching .ber file; grep --only-matching .ber file
�ber
über

mensfeld commented 7 years ago

+1

ggreer / the_silver_searcher

Word boundaries with non-ASCII words #721