google / zoekt

Fast trigram based code search
1.69k stars 113 forks source link

Panic when indexing document with symbols that cross rune boundaries #49

Closed mmm444 closed 6 years ago

mmm444 commented 6 years ago

Sometimes ctags (both exuberant and universal) return symbol boundaries that cross rune boundaries. This leads to a panic during indexing.

I have reduced the issue that happened during indexing my codebase to this test:

func TestBadRunesFromCtags(t *testing.T) {
    content := []byte("Hlášení: {0}")
    b := testIndexBuilder(t, nil,
        Document{
            Name:    "bad_rune.cs",
            Content: content,
            Symbols: []DocumentSection{{6, 9}}, // splits í rune (ctags does this)
        },
    )
    searchForTest(t, b, &query.Substring{Pattern: "Hlášení"})
}

which panics:

--- FAIL: TestBadRunesFromCtags (0.00s)
panic: runtime error: index out of range [recovered]
        panic: runtime error: index out of range

goroutine 340 [running]:
testing.tRunner.func1(0xc42012a2d0)
        /usr/local/go/src/testing/testing.go:711 +0x2d2
panic(0x799260, 0xa05be0)
        /usr/local/go/src/runtime/panic.go:491 +0x283
github.com/google/zoekt.(*postingsBuilder).newSearchableString(0xc42007c500, 0xc42012c34f, 0xf, 0x1, 0xc42012c350, 0x1, 0x1, 0x2, 0xc42023fb30, 0x0, ...)
        /home/mmm/go/src/github.com/google/zoekt/indexbuilder.go:138 +0xc07
github.com/google/zoekt.(*IndexBuilder).Add(0xc4200b8420, 0x803c18, 0x7, 0xc42012c340, 0xf, 0x10, 0x0, 0x0, 0x0, 0x0, ...)
        /home/mmm/go/src/github.com/google/zoekt/indexbuilder.go:376 +0x451
github.com/google/zoekt.testIndexBuilder(0xc42012a2d0, 0x0, 0xc420051f20, 0x1, 0x1, 0x10)
        /home/mmm/go/src/github.com/google/zoekt/index_test.go:48 +0x1b4
github.com/google/zoekt.TestBadRunesFromCtags(0xc42012a2d0)
        /home/mmm/go/src/github.com/google/zoekt/index_test.go:1743 +0x18b
testing.tRunner(0xc42012a2d0, 0x81e5c0)
        /usr/local/go/src/testing/testing.go:746 +0xd0
created by testing.(*T).Run
        /usr/local/go/src/testing/testing.go:789 +0x2de
exit status 2
FAIL    github.com/google/zoekt 0.042s

This happened on latest HEAD which was at 3f9ebd3f2087991625ed8666bd805df6787a341b at time of writing this issue report.

mmm444 commented 6 years ago

I think the issue can be solved by changing == to <= on lines https://github.com/google/zoekt/blob/master/indexbuilder.go#L94 and https://github.com/google/zoekt/blob/master/indexbuilder.go#L118 . But I am not sure if this is the right way and place to deal with it.

hanwen commented 6 years ago

The document sections are generated by Zoekt too, as CTags doesn't actually output byte ranges, IIRC. Can you check how this invalid input actually got generated?

I'm a bit torn whether this should generate an error rather than silently swallowing/modifying the given section .

mmm444 commented 6 years ago

Uh oh, sorry about the confusion. I was expecting something that is not true.

The way to trick ctags into spliting the rune is simple:

# cat test.cs
aá:{0}
# hexdump -C test.cs 
00000000  61 c3 a1 3a 7b 30 7d 0a                           |a..:{0}.|
00000008
# ctags --version
Universal Ctags 0.0.0(7918d19f), Copyright (C) 2015 Universal Ctags Team
Universal Ctags is derived from Exuberant Ctags.
Exuberant Ctags 5.8, Copyright (C) 1996-2009 Darren Hiebert
  Compiled: Feb 12 2018, 11:44:40
  URL: https://ctags.io/
  Optional compiled features: +wildcards, +regex, +multibyte, +option-directory, +xpath
# ctags -f- test.cs
aà  test.cs /^aá:{0}$/;"    p
# ctags -f- test.cs | hexdump -C
00000000  61 c3 09 74 65 73 74 2e  63 73 09 2f 5e 61 c3 a1  |a..test.cs./^a..|
00000010  3a 7b 30 7d 24 2f 3b 22  09 70 0a                 |:{0}$/;".p.|
0000001b

In the hexdump you can see that there are only 2 bytes before the tab 0x09. So the reasons why this happens are basically a confusing file extension and some non-ASCII UTF-8 inside that resembles a programming language.

The more I think about it I am also not sure whether this should be fixed on the Zoekt side. But where? Is this a bug of ctags encoding handling then?

hanwen commented 6 years ago

Which version of ctags do you use? I think universal-ctag --json should handle this correctly, no?

hanwen commented 6 years ago

oh sorry, Ctags 0.0.0(7918d19f) - I missed the blurb.

I'll have a look.

mmm444 commented 6 years ago

I was using HEAD version universal-ctags but I didn't rename it to universal-ctags. I instead changed the name in the zoekt source code. And that turned out to be a big mistake.

When I run the universal-ctags with JSON output it fails in a strange way:

# universal-ctags --output-format=json -f- test.cs 
(null)

But this output is actually handled in zoekt. So in the end everything works.

Do you think it is worth reporting the (null) output issue to universal-ctags?

Feel free to close this issue. Thank you for you time!

hanwen commented 6 years ago

yes, please do.

There is still something to do here, though, but I'll take care of it.

mmm444 commented 6 years ago

https://github.com/universal-ctags/ctags/issues/1686

hanwen commented 6 years ago

https://gerrit-review.googlesource.com/c/zoekt/+/159230