ggreer / the_silver_searcher

A code-searching tool similar to ack, but faster.
http://geoff.greer.fm/ag/
Apache License 2.0
26.05k stars 1.42k forks source link

searching utf-8 on mac is always case sensetive #192

Open igoralekseev opened 11 years ago

igoralekseev commented 11 years ago

installed Ag with brew on Mac OS 10.8.3, (russian in examples)

ag -i ваш doesnt find 'Ваш' but ag -i Ваш find 'Ваш'

afredd commented 11 years ago

I suspect this drills down to the code using tolower() in boyer_moore_strncasestr().

The fix may be as simple as putting this early on in main()

setlocale(LC_CTYPE, ""); /* Use the current locale's tolower(). */

This will depend on whether you're using a 8 bit character set or full utf8 (ie. the text you entered above are seen by ag as multiple bytes).

Qu. does -i work if the literal is turned into a regex by appending a '.' ? (This will test the PCRE case insensitive code.)

einars commented 11 years ago

(archlinux, same issue)

No, setlocale(LC_CTYPE, "") doesn't seem to improve anything, Yes, -i does work when the text is turned to regex with a dot appended.

igoralekseev commented 11 years ago

Didnt try setlocale -- not my field of competence

-i and dot doesnt work either

novalis commented 11 years ago

There's not going to be a simple fix for this. ag's plain text search routines assume that the data is ascii. This will be hard to change, because the routines are based on Boyer-Moore, which searches (sort-of) backwards, while C's multibyte character functions assume that text is being processed forwards (it may be that mbstate_t does not matter for UTF-8, but I just don't know). Further, even if someone did write a simple UTF-8 matcher, it wouldn't handle the gigantic nightmare of combining characters etc.

It would be nice to just switch to libicu (which handles all of the necessary canonicalization), but this requires converting the needle and haystack to UTF-16 (the worst of all possible encodings). That's not impossible, but it would potentially use a lot of memory given that ag operates on mmapped files of arbitrary (well, up to 2GB) size. So, rewriting it to work on chunks of files would probably be the right thing to do.

It would be possible to simply switch to a regex search, but only if (a) PCRE_UTF8 were included in pcre_opts and (b) your version of pcre were compiled with UTF-8 support. Mine (Ubuntu 12.04) is not. This would, of course, also solve the problem of UTF-8 regexes, which none of the other solutions would do. I don't know how pcre handles canonicalization

skrattaren commented 10 years ago

archlinux, same issue

Worth renaming the issue, then.

I was surprised to find that simple GNU grep finds cyrillic case-insensitively all right.

plumdog commented 9 years ago

I found what I think is a closely related issue on Ubuntu 14.04, and thought I'd report it here.

My file is f.txt and looks like

ö
o

I then ran the following:

❯ ag o f.txt
2:o
❯ ag ö f.txt
1:ö
❯ ag -i o f.txt
2:o
❯ ag -i ö f.txt
# no results!

So I'm getting fewer results by using case insensitive search. Running the same with grep worked fine.

anlutro commented 9 years ago

I'm on Debian and having similar issues. I have to search case-sensitively if the string contains non-ASCII characters.

$ echo "Datatilbehør" > test.txt 
$ ag -i 'datatilbehør'
$ ag -i 'Datatilbehør'
$ ag 'datatilbehør'
$ ag 'Datatilbehør'
test.txt
1:Datatilbehør

Related issue: https://github.com/ggreer/the_silver_searcher/issues/553