google / zoekt

Fast trigram based code search
1.69k stars 113 forks source link

[gitindex] Indexing repositories with malformed documents / missing blobs #73

Open r10r opened 5 years ago

r10r commented 5 years ago

When running zoekt-git-index on all of our GIT repositories, I've noticed that few repositories are missing from the search. After digging into it I discovered that the indexer aborts at the first indexing error. Since it may happen from time to time that a repository contains a malformed document (e.g with invalid UTF-8 sequences ...) the indexer should be able to ignore these erros. I've added a flag ContinueOnError to allow indexing of repositories with missing blobs and malformed documents. These repositories should be fixed anyway - but in the meantime only the broken files are not indexed but not the whole repository.

hanwen commented 5 years ago

For invalid UTF-8 sequences, we should just insert a placeholder and continue. I quickly looked at the code, and I think it's already doing that. Can you verify if it really aborted for invalid UTF-8 ?

r10r commented 5 years ago

This is the error I get, I only suspect it to be an encoding / UTF-8 error. Honestly I did not search for the root cause.

2019/01/08 11:37:48 failed to add document rootfs/usr/share/aptitude/README.cs : no rune for section boundary at byte 514

README.cs.zip

hanwen commented 5 years ago

aha. Could you share the file with me? Or maybe make a smaller reproducer? You probably need to cut off runes from the start in multiples of 100.

r10r commented 5 years ago

File is already attached in the previous comment. I've zipped it because otherwise github does not let me upload it.

r10r commented 5 years ago

I've refactored the patch and uploaded it to gerrit ( a new commit with a new changeset id ). Honestly I'm a little bit confused with the gerrit workflow, never worked with it before. Please tell me if I have to change something. Thanks.

hanwen commented 5 years ago

can't repro. Which version were you using?

hanwen@han-wen:~/go/src/github.com/google/zoekt$ git log HEAD |head -1
commit 43635377d1e262e9a40da6d865ba8f8d2157b88f
hanwen@han-wen:~/go/src/github.com/google/zoekt$ go install github.com/google/zoekt/cmd/zoekt-index && zoekt-index --file_limit 1000000 t/
2019/01/08 20:41:06 finished /usr/local/google/home/hanwen/.zoekt/t_v15.00000.zoekt: 1076357 index bytes (overhead 3.1)
hanwen@han-wen:~/go/src/github.com/google/zoekt$ ls -l t/
total 340
-rw-r--r-- 1 hanwen primarygroup 347653 Jan  8 14:20 README.cs