monochromegane / the_platinum_searcher

A code search tool similar to ack and the_silver_searcher(ag). It supports multi platforms and multi encodings.
MIT License
2.81k stars 142 forks source link

failure to match (incorrect encoding issues). #174

Open mijoharas opened 7 years ago

mijoharas commented 7 years ago

I have a repo here with a minimum example that exhibits the problem.

The file is utf-8 and has some emojis in it. trying to search for foobar with: pt foobar example.txt will not show a match.

Detected points[utf8/eucjp/shiftjis] is 1/0/2.

This is a minimum example that shows the problem, other files seem to have the incorrect encoding detected.

the bytes for the lines are interpretted in UTF-8 as:

scanner.Bytes() [240 159 146 184]
scanner.Bytes() [226 152 149]
scanner.Bytes() [240 159 145 139]
scanner.Bytes() [102 111 111 98 97 114]

and in Shift-JIS as:

scanner.Bytes() [239 191 189 233 160 130]
scanner.Bytes() [231 172 152]

I've got two suggestions on how to solve this, though I don't know too much about encoding schemes.

First suggestion is to simply add some override options that allow us to specify the encoding --utf8 and --shiftJIS will do what would be expected.

Second suggestion would be to try decoding a portion of the file as UTF-8 or SHIFT-JIS (e.g. with https://godoc.org/golang.org/x/text/encoding ) and then see if that produces an error. I don't know much about SHIFT-JIS so I'm not sure whether this would be a good example.

Have you got any thoughts?