Closed staktrace closed 6 years ago
Oh fun. I swear I had code at some point to drop invalid UTF-8, but it looks like I only filter it from the search line result, not the context.
I'm tempted to entirely drop (with a warning) any source files that aren't valid utf-8. Would that be workable for you? You'd need to do something like iconv -f iso-8859-15 -t utf-8
to transcode that dictionary file, yourself. Otherwise the second-easiest thing would be to replace any ill-formed lines with a placeholder or similar. I'd rather not support for transcoding into livegrep itself, if I can avoid it.
I'd prefer to not have to transcode the file. Replacing ill-formed lines/context with placeholders would be fine for me.
Also for future reference the specific character causing the problem is the accented o
at https://searchfox.org/mozilla-central/rev/c3fef66a5b211ea8038c1c132706d02db408093a/extensions/spellcheck/locales/en-US/hunspell/en-US.dic#2772 which forms part of the before-context for one of the matches a few lines down.
Verified the fix works, thanks!
I was able to narrow down the problem from #182 a little bit. I'm starting the codesearch tool like so:
and sending it a query which produces this output:
If I comment out the line at https://github.com/livegrep/livegrep/blob/5aacba4bce494ad7ca5b07bb25e4a140c1731f87/src/tools/grpc_server.cc#L188 then it works fine. The query is for the string
onchit
in the hunspell english dictionary file, you can see the raw file at https://hg.mozilla.org/mozilla-central/raw-file/tip/extensions/spellcheck/locales/en-US/hunspell/en-US.dic