google / zoekt

Fast trigram based code search
1.69k stars 113 forks source link

parallel loading of shards deadlocks webserver #23

Closed hanwen closed 6 years ago

hanwen commented 7 years ago

this happens on cs.bazel.build, most likely introduced by e9c19a019ce90ce9ce7021c81c8d0bb10aedeaed

commit e9c19a019ce90ce9ce7021c81c8d0bb10aedeaed (origin/master) Author: Han-Wen Nienhuys hanwen@google.com Date: Tue Jun 13 21:14:17 2017 +0200

Load shards as fast as we can

This speeds up restart after crash/update.
hanwen commented 7 years ago

2017/06/13 20:35:31 got query "r:ctags getpid " 2017/06/13 20:53:11 reloading: _zoekt_repos_android.googlesource.com_platform_art.git_v13.00000.zoekt, err 2017/06/13 20:53:24 reloading: _zoekt_repos_android.googlesource.com_platform_development.git_v13.00000.zoekt, err 2017/06/13 20:53:34 reloading: _zoekt_repos_android.googlesource.com_platform_external_r8.git_v13.00000.zoekt, err ... 2017/06/13 21:12:26 reloading: _zoekt_repos_chromium.googlesource.com_chromium_src_third_party.git_v13.00001.zoekt, err 2017/06/13 21:12:27 reloading: _zoekt_repos_chromium.googlesource.com_chromium_src_third_party.git_v13.00000.zoekt, err 2017/06/13 21:26:30 http: Accept error: accept tcp [::]:80: accept4: too many open files; retrying in 5ms 2017/06/13 21:26:30 http: Accept error: accept tcp [::]:80: accept4: too many open files; retrying in 10ms

hanwen commented 7 years ago

alternative explanation is the apt-get update that triggered some other breakage.

alternative way to fix this: have systemd restart the server if 80 doesn't come up in 60s.

hanwen commented 7 years ago

systemd doesn't support this, seemingly; suggestions to use 'monit' on SO.

maybe add a watchdog inside the webserver itself?

hanwen commented 7 years ago

the server deadlocks itself during update.

Each health request from GCP is a incoming connection which is blocked, leading to fd exhaustion.

rolling back to last known good, 2017-06-13T1917-6cc1674

Jun 14 08:27:27 zoekt-60gb-16cpu-localssd zoekt-webserver[26593]: goroutine 12009958 [chan send, 6 minutes]: Jun 14 08:27:27 zoekt-60gb-16cpu-localssd zoekt-webserver[26593]: github.com/google/zoekt/shards.(shardWatcher).rlock(0xc42016e510) Jun 14 08:27:27 zoekt-60gb-16cpu-localssd zoekt-webserver[26593]: /usr/local/google/home/hanwen/go/src/github.com/google/zoekt/shards/watcher.go:156 +0x45 Jun 14 08:27:27 zoekt-60gb-16cpu-localssd zoekt-webserver[26593]: github.com/google/zoekt/shards.(shardedSearcher).List(0xc73e95ad70, 0x7fe2da0e5000, 0xc420014490, 0x9ca260, 0xc609a960da, Jun 14 08:27:27 zoekt-60gb-16cpu-localssd zoekt-webserver[26593]: /usr/local/google/home/hanwen/go/src/github.com/google/zoekt/shards/shards.go:159 +0x6f Jun 14 08:27:27 zoekt-60gb-16cpu-localssd zoekt-webserver[26593]: github.com/google/zoekt/web.(Server).fetchStats(0xc73e9ea240, 0x5940f208, 0xc4b3356bc0, 0x402d39) Jun 14 08:27:27 zoekt-60gb-16cpu-localssd zoekt-webserver[26593]: /usr/local/google/home/hanwen/go/src/github.com/google/zoekt/web/server.go:277 +0xc1 Jun 14 08:27:27 zoekt-60gb-16cpu-localssd zoekt-webserver[26593]: github.com/google/zoekt/web.(Server).serveSearchBoxErr(0xc73e9ea240, 0x9cf260, 0xc62c7d42a0, 0xc645ded200, 0xc420022c50, 0 Jun 14 08:27:27 zoekt-60gb-16cpu-localssd zoekt-webserver[26593]: /usr/local/google/home/hanwen/go/src/github.com/google/zoekt/web/server.go:294 +0x40 Jun 14 08:27:27 zoekt-60gb-16cpu-localssd zoekt-webserver[26593]: github.com/google/zoekt/web.(Server).serveSearchBox(0xc73e9ea240, 0x9cf260, 0xc62c7d42a0, 0xc645ded200) Jun 14 08:27:27 zoekt-60gb-16cpu-localssd zoekt-webserver[26593]: /usr/local/google/home/hanwen/go/src/github.com/google/zoekt/web/server.go:330 +0x4d Jun 14 08:27:27 zoekt-60gb-16cpu-localssd zoekt-webserver[26593]: github.com/google/zoekt/web.(Server).(github.com/google/zoekt/web.serveSearchBox)-fm(0x9cf260, 0xc62c7d42a0, 0xc645ded200) Jun 14 08:27:27 zoekt-60gb-16cpu-localssd zoekt-webserver[26593]: /usr/local/google/home/hanwen/go/src/github.com/google/zoekt/web/server.go:148 +0x48 Jun 14 08:27:27 zoekt-60gb-16cpu-localssd zoekt-webserver[26593]: net/http.HandlerFunc.ServeHTTP(0xc73e95ada0, 0x9cf260, 0xc62c7d42a0, 0xc645ded200) Jun 14 08:27:27 zoekt-60gb-16cpu-localssd zoekt-webserver[26593]: /usr/lib/google-golang/src/net/http/server.go:1942 +0x44 Jun 14 08:27:27 zoekt-60gb-16cpu-localssd zoekt-webserver[26593]: net/http.(ServeMux).ServeHTTP(0xc73e9ec180, 0x9cf260, 0xc62c7d42a0, 0xc645ded200) Jun 14 08:27:27 zoekt-60gb-16cpu-localssd zoekt-webserver[26593]: /usr/lib/google-golang/src/net/http/server.go:2238 +0x130 Jun 14 08:27:27 zoekt-60gb-16cpu-localssd zoekt-webserver[26593]: net/http.serverHandler.ServeHTTP(0xc73e7b73f0, 0x9cf260, 0xc62c7d42a0, 0xc645ded200) Jun 14 08:27:27 zoekt-60gb-16cpu-localssd zoekt-webserver[26593]: /usr/lib/google-golang/src/net/http/server.go:2568 +0x92 Jun 14 08:27:27 zoekt-60gb-16cpu-localssd zoekt-webserver[26593]: net/http.(conn).serve(0xc623389720, 0x9cfa60, 0xc56a5e7980) Jun 14 08:27:27 zoekt-60gb-16cpu-localssd zoekt-webserver[26593]: /usr/lib/google-golang/src/net/http/server.go:1825 +0x612 Jun 14 08:27:27 zoekt-60gb-16cpu-localssd zoekt-webserver[26593]: created by net/http.(*Server).Serve Jun 14 08:27:27 zoekt-60gb-16cpu-localssd zoekt-webserver[26593]: /usr/lib/google-golang/src/net/http/server.go:2668 +0x2ce

hanwen commented 7 years ago

fixed by reverting in https://github.com/google/zoekt/commit/e20e7193156aebaff2c28c0598bd47f2066eba14

hanwen commented 6 years ago

looks like this was reintroduced with https://gerrit-review.googlesource.com/c/zoekt/+/152374 shards: load up to NumCPU() shards in parallel

log.txt

hanwen commented 6 years ago

$ python p.py writing logs to /zoekt/logs/zoekt-webserver.2018-01-15T15-05-32.910730325Z.16979.log SIGQUIT: quit PC=0x5747b1 m=0 sigcode=0 0 [idle]: runtime.futex(0xbbbd90, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x7ffc7f7f4648, 0x5268db, ...) /usr/lib/google-golang/src/runtime/sys_linux_amd64.s:438 +0x21 runtime.futexsleep(0xbbbd90, 0x7ffc00000000, 0xffffffffffffffff) /usr/lib/google-golang/src/runtime/os_linux.go:45 +0x62 runtime.notesleep(0xbbbd90) /usr/lib/google-golang/src/runtime/lock_futex.go:151 +0x9b runtime.stopm() /usr/lib/google-golang/src/runtime/proc.go:1689 +0xe5 runtime.gcstopm() /usr/lib/google-golang/src/runtime/proc.go:1893 +0xb7 runtime.schedule() /usr/lib/google-golang/src/runtime/proc.go:2228 +0x2c0 runtime.park_m(0xc420146900) /usr/lib/google-golang/src/runtime/proc.go:2327 +0xb6 runtime.mcall(0x7ffc7f7f4770) /usr/lib/google-golang/src/runtime/asm_amd64.s:286 +0x5b 1 [semacquire, 21 minutes]: sync.runtime_Semacquire(0xc42019400c) /usr/lib/google-golang/src/runtime/sema.go:56 +0x39 sync.(WaitGroup).Wait(0xc420194000) /usr/lib/google-golang/src/sync/waitgroup.go:131 +0x72 github.com/google/zoekt/shards.(shardWatcher).scan(0xc4201928a0, 0x0, 0xc4201928d0) /usr/local/google/home/hanwen/go/src/github.com/google/zoekt/shards/watcher.go:123 +0x970 github.com/google/zoekt/shards.NewDirectoryWatcher(0x7ffc7f7f5f3d, 0xc, 0xb71560, 0xc42004af60, 0xc420192870, 0x54bc3b, 0xc42016bd58, 0x853c43) /usr/local/google/home/hanwen/go/src/github.com/google/zoekt/shards/watcher.go:58 +0x104 github.com/google/zoekt/shards.NewDirectorySearcher(0x7ffc7f7f5f3d, 0xc, 0x7ffc7f7f5f53, 0xb, 0x4e94914f0000, 0x0) /usr/local/google/home/hanwen/go/src/github.com/google/zoekt/shards/shards.go:65 +0x168 main.main() /usr/local/google/home/hanwen/go/src/github.com/google/zoekt/cmd/zoekt-webserver/main.go:140 +0x427 8 [chan receive, 21 minutes]: main.divertLogs(0x7ffc7f7f5f53, 0xb, 0x4e94914f0000) /usr/local/google/home/hanwen/go/src/github.com/google/zoekt/cmd/zoekt-webserver/main.go:61 +0x93 created by main.main /usr/local/google/home/hanwen/go/src/github.com/google/zoekt/cmd/zoekt-webserver/main.go:137 +0xf13 hanwen@han-wen:~/Downloads$ cat p.py

for f in open('log.txt').read().split('\n goroutine'): if 'throttledLoader' not in f: print f