livegrep attempts to put invalid utf-8 data into a protocol buffer

staktrace commented 6 years ago

I was able to narrow down the problem from #182 a little bit. I'm starting the codesearch tool like so:

codesearch -grpc localhost:8082 -load_index /home/ubuntu/index/mozilla-central/livegrep.idx -max_matches 1000 -timeout 10000

and sending it a query which produces this output:

processing query line='onchit' file='hunspell' tree='mozilla-central|mozilla-central-__GENERATED__' tags='' not_file='' not_tree='' not_tags='' max_matches='1000'
[libprotobuf ERROR external/com_google_protobuf/src/google/protobuf/wire_format_lite.cc:626] String field 'SearchResult.context_before' contains invalid UTF-8 data when serializing a protocol buffer. Use the 'bytes' type if you intend to send raw bytes.

If I comment out the line at https://github.com/livegrep/livegrep/blob/5aacba4bce494ad7ca5b07bb25e4a140c1731f87/src/tools/grpc_server.cc#L188 then it works fine. The query is for the string onchit in the hunspell english dictionary file, you can see the raw file at https://hg.mozilla.org/mozilla-central/raw-file/tip/extensions/spellcheck/locales/en-US/hunspell/en-US.dic

nelhage commented 6 years ago

Oh fun. I swear I had code at some point to drop invalid UTF-8, but it looks like I only filter it from the search line result, not the context.

I'm tempted to entirely drop (with a warning) any source files that aren't valid utf-8. Would that be workable for you? You'd need to do something like iconv -f iso-8859-15 -t utf-8 to transcode that dictionary file, yourself. Otherwise the second-easiest thing would be to replace any ill-formed lines with a placeholder or similar. I'd rather not support for transcoding into livegrep itself, if I can avoid it.

staktrace commented 6 years ago

I'd prefer to not have to transcode the file. Replacing ill-formed lines/context with placeholders would be fine for me.

staktrace commented 6 years ago

Also for future reference the specific character causing the problem is the accented o at https://searchfox.org/mozilla-central/rev/c3fef66a5b211ea8038c1c132706d02db408093a/extensions/spellcheck/locales/en-US/hunspell/en-US.dic#2772 which forms part of the before-context for one of the matches a few lines down.

staktrace commented 6 years ago

Verified the fix works, thanks!

livegrep / livegrep

livegrep attempts to put invalid utf-8 data into a protocol buffer #185