Closed aisbaa closed 4 years ago
I believe @keegancsmith has more context on this.
Just to get a little bit more context:
/site-admin/repositories
and scroll to the bottom, is the number in X repositories total (showing first 20)
what you expect?repo
keyword. And then I click on search result - sourcegraph starts cloning.Logs:
2020/02/14 08:30:36 error indexing gitolite/namespace/repo@hash: Get http://sourcegraph-frontend-internal/.internal/search/configuration: dial tcp REDACTED_IP:80: connect: cannot assign requested address
I've tried issuing request from indexer pod:
/home/sourcegraph # wget http://sourcegraph-frontend-internal/
Connecting to sourcegraph-frontend-internal (REDACTED_IP:80)
wget: server returned error: HTTP/1.1 404 Not Found
/home/sourcegraph # wget http://sourcegraph-frontend-internal/.internal/
Connecting to sourcegraph-frontend-internal (REDACTED_IP:80)
wget: can't connect to remote host (REDACTED_IP): Address not available
Deleted and created sourcegraph-internal-frontend service, will see if that solves indexer issue.
Recreating sourcegraph-internal-frontend service did not help.
I've discovered one more issue which might be related - indexer fails reach sourcegraph-frontend-internal svc.
This is likely related, and signals the problem may be with your sourcegraph-frontend deployment. Can you check the logs of repo-updater
to see if it also has similar connectivity issues with the frontend?
Also check the frontend logs to see if there is any indication there as to why it might not be responding (you can also just shoot us those in an email to support@sourcegraph.com if you don't want to dig through them yourself).
For background, indexing is an out-of-band process so it cannot impede cloning. repo-updater is what determines the schedule of repository updates and effectively just asks gitserver "try to git pull
this repo now". If repo-updater cannot talk to the frontend-internal service, it may be blocked from enqueuing repo updates.
Also, if you have the builtin Grafana deployment active, you may also check https://sourcegraph.example.com/-/debug/grafana to see if any alerts have been firing there which would tell us if there may be a different issue here.
Repo updater works as usual, no connection errors logged. Frontend logs on standby cluster contain only refresh requests coming from gitolite. Repo updater logs do not contain connection errors. Though it does contain quite a few warnings:
t=2020-02-14T14:37:11+0000 lvl=warn msg="unable to construct clone URL for repo" name=REDACTED phabricator_id=PHID-REPO-REDACTED
Grafana does not work for some reason either. When I test connection from UI - it fails with 404. But from within container everything works:
/usr/share/grafana # wget http://prometheus:30090/metrics -O /dev/stdout | head
Connecting to prometheus:30090 (10.3.0.219:30090)
# HELP go_gc_duration_seconds A summary of the GC invocation durations.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 1.6535e-05
go_gc_duration_seconds{quantile="0.25"} 2.6565e-05
go_gc_duration_seconds{quantile="0.5"} 2.9371e-05
We discovered something interesting, it seems that indexer is stuck waiting for connection pool to free up:
indexer # netstat | grep TIME_WAIT | wc -l
22018
indexer # sysctl -a | grep net.ipv4.ip_local_port_range
net.ipv4.ip_local_port_range = 32768 60999
I'm not sure if lsof gives info on TIME_WAIT, but it might be useful to see a snippet of the output of
apk add --no-cache lsof
lsof -iTCP -P
Additionally you can pass the -debug
flag to zoekt-sourcegraph-indexserver
to turn on more verbose debug logs. This might help?
Some context:
zoekt-sourcegraph-indexserver
does the follow HTTP requests:
Then each to index each repo we spawn a subprocess zoekt-archive-index
. If it decides to index it will fetch the working copy via an HTTP request. However, we only ever have one of these processes running at a time, and I assume if it dies the socket related resources are cleaned up.
This doesn't sound like anything that should leave TCP requests lieing around (except maybe zoekt-archive-index
?). Would be interested to see debug logs/lsof output.
@aisbaa I just emailed you to set up time for a call. My initial thoughts on this:
indexer
service has been removed as of 3.11 I believe.The indexer service has been removed as of 3.11 I believe.
Oh sorry, my comment above was referring to zoekt. I forgot about the indexer service. It should be deleted! It was last present in 2.13 (from 2018)!!
deploy-sourcegraph on master
❯ git log -n1 -G indexer
commit c01dfc624b8e8072c89e6c9ae4c5404d3368eb64
Author: Chris Wendt <chrismwendt@gmail.com>
Date: Fri Dec 14 12:16:12 2018 -0800
Remove lsp-proxy, indexer, xlang-* (#152)
deploy-sourcegraph on master
❯ git describe --tags c01dfc624b8e8072c89e6c9ae4c5404d3368eb64
v2.13.0-14-gc01dfc62
Maybe the fix here is to just remove the indexer service?
Upgraded to 3.11, the issue still persists. Will upgrade to 3.12.
/home/sourcegraph # lsof -iTCP -P
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
zoekt-sou 8 root 5u IPv6 30943662 0t0 TCP *:6072 (LISTEN)
zoekt-sou 8 root 24u IPv4 40190741 0t0 TCP indexed-search-0.indexed-search.default.svc.cluster.local:35228->sourcegraph-frontend-internal.default.svc.cluster.local:80 (ESTABLISHED)
zoekt-sou 8 root 31u IPv4 40194832 0t0 TCP indexed-search-0.indexed-search.default.svc.cluster.local:35260->sourcegraph-frontend-internal.default.svc.cluster.local:80 (ESTABLISHED)
/home/sourcegraph # netstat | awk '{ print $6 }' | sort | uniq -c
1
3 ESTABLISHED
1 Foreign
1 I-Node
23052 TIME_WAIT
1 servers)
/home/sourcegraph # pstree
-+= 00001 root /sbin/tini -- /bin/sh -c zoekt-sourcegraph-indexserver -index $DATA_DIR -sourcegraph_url $SRC_FRONTEND_INTERNAL -listen :6072 -interval 1m -cpu_fraction 1.0
\-+= 00008 root zoekt-sourcegraph-indexserver -index /data/index -sourcegraph_url http://sourcegraph-frontend-internal -listen :6072 -interval 1m -cpu_fraction 1.0
\-+- 55798 root zoekt-archive-index -name REDACTED -commit 92b8ad405c11ee3149f45e180351b62f172f428a -incremental -index /data/index -parallelism 8 -file_limit 1048576 -branch HEAD -download-limit-mbps 1000 -require_ctags http://sourcegraph-frontend-internal/.internal/git/REDACTED/tar/REDACTED
\--- 55806 root /usr/local/bin/universal-ctags --_interactive=default --fields=* --languages=Basic,C,C#,C++,Clojure,Cobol,CSS,CUDA,D,Elixir,elm,Erlang,Go,haskell,Java,JavaScript,kotlin,Lisp,Lua,MatLab,ObjectiveC,OCaml,Perl,Perl6,PHP,Protobuf,Python,R,Ruby,Rust,scala,Scheme,Sh,swift,Tcl,typescript,tsx,Verilog,Vim --map-CSS=+.scss --map-CSS=+.less --map-CSS=+.sass
If I pstree couple of times I see that zoekt-archive-index index different repo each time.
How do you pass -debug
? adding args:
to indexed search manifest did not work.
diff --git a/base/indexed-search/indexed-search.StatefulSet.yaml b/base/indexed-search/indexed-search.StatefulSet.yaml
index 2af21467..0d97201f 100644
--- a/base/indexed-search/indexed-search.StatefulSet.yaml
+++ b/base/indexed-search/indexed-search.StatefulSet.yaml
@@ -44,6 +44,8 @@ spec:
- mountPath: /data
name: data
- env:
+ args:
+ - -debug
image: index.docker.io/sourcegraph/zoekt-indexserver:0.0.20191204145522-b470e5f@sha256:84e9de8be269277d6e4711a61d0c9675a44d409a4bf7f7dd1b90a22175095fa7
terminationMessagePolicy: FallbackToLogsOnError
name: zoekt-indexserver
Indexed search fails to start with [FATAL tini (8)] exec -debug failed: No such file or directory
Sorry, this isn't as easy to set since we have a bunch of args we pass in. Maybe try this:
args:
- /bin/sh
- "-c"
- "zoekt-sourcegraph-indexserver -index $DATA_DIR -sourcegraph_url $SRC_FRONTEND_INTERNAL -listen :6072 -interval 1m -cpu_fraction 1.0 -debug"
I think I am seeing the same problem on sourcegraph.com, just not as bad. But a bunch of ports in TIME_WAIT with no associated PID. Best guess is zoekt-archive-index is not closing up sockets properly. I will investigate further. Apologies for this.
Edit: as a workaround you can try setting -interval 1m
to a higher value. This will slow down how often we poll for indexing, but will result in less subprocesses taking up sockets in TIME_WAIT. We should have a fix available so this will just be a temporary measure.
Upgraded to 3.12.4 within minutes after restart with -debug
flag TIME_WAIT grew to 20k. lsof -iTCP -P
looks fine, 2-3 connections at a time.
The important question is - can this issue impact git cloning?
I'm pinging API to check how many repos are there to clone since few hours ago. The weird thing is that amount of repo to clone grew from 7317 to 7323. If I remember correctly on Friday it was around 6k repos queued.
I think I may have solved this issue a few days ago without realising it https://github.com/sourcegraph/zoekt/commit/f438357f9b306c048899e3bc7e55da1454ff3b9c
Busy building a new version of the zoekt images, then I'd suggest trying it out on your 3.12.4 cluster.
The important question is - can this issue impact git cloning?
I'm pinging API to check how many repos are there to clone since few hours ago. The weird thing is that amount of repo to clone grew from 7317 to 7323. If I remember correctly on Friday it was around 6k repos queued.
hah, good question. I would suspect it shouldn't unless somehow the kernel is not isolating networking correctly. Just a thought, have you done anything recently like change the number of gitservers?
I only upgraded to 3.9.4 a week ago, then on Wednesday noticed that repository cloning got stuck. Can't recall any other changes, but will check the git log. Depoying 3.12.7.
Images are:
sourcegraph/zoekt-webserver:0.0.20200217151107-f2e5060
sourcegraph/zoekt-indexserver:0.0.20200217151405-f2e5060
However, the point about this actually affecting git is a good point. I would be surprised. So just to confirm you upgraded to 3.9.4 and then noticed it got stuck a little later? What version were you on before that? I'll take a look over the git log to look for anything suspicious.
So just to confirm you upgraded to 3.9.4 and then noticed it got stuck a little later? What version were you on before that? I'll take a look over the git log to look for anything suspicious.
We were on v3.8.1.
It seems you've fixed socket leak, it stays around 60ish waiting connections:
/home/sourcegraph # netstat | awk '{ print $6 }' | sort | uniq -c
1
2 ESTABLISHED
1 Foreign
1 I-Node
58 TIME_WAIT
1 servers)
/home/sourcegraph # lsof -iTCP -P
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
zoekt-sou 9 root 3u IPv4 52552745 0t0 TCP indexed-search-0.indexed-search.default.svc.cluster.local:60566->sourcegraph-frontend-internal.default.svc.cluster.local:80 (ESTABLISHED)
zoekt-sou 9 root 5u IPv6 52552739 0t0 TCP *:6072 (LISTEN)
Pending repositories number jumped a little again, now its 7329.
We have disableAutoGitUpdates
set to True
. Could that be partly related? It was set ages ago.
We have
disableAutoGitUpdates
set toTrue
. Could that be partly related? It was set ages ago.
Yes, this is highly related. If this is set to true then repo-updater
won't forward new repositories to be cloned to gitserver.
Sorry, I might be wrong. Looking at the code, it's possible that disableAutoGitUpdates
doesn't affect the initial cloning, but only the updating of repos.
Just for the record: this was fixed by changing the disableAutoGitUpdates
value. See https://github.com/sourcegraph/sourcegraph/issues/8458
disableAutoGitUpdates
was the culprit, setting it to false sourcegraph started cloning repositories.
@aisbaa would you like to turn disableAutoGitUpdates
back on? Trying to understand prioritization of supporting / fixing cloning when disableAutoGitUpdates
is set to true
.
I've turned it off for now because request count from sourcegraph to gitolite skyrockets when its turned on. If it is possible to configure AutoGitUpdates to not issue so many requests, then I'm happy to use it.
You can see the jump in requests right after we set disableAutoGitUpdates: false
, gitMaxConcurrentClones
was set to 5:
It went from below 400 requests per minute to 1.7k-1.9k per minute. We did try setting gitMaxConcurrentClones
to 8 and later we set it to 4. With 4 git max concurrent clones it seems to have stabilised to 600-800 req/min, but I'm not certain if US developers started their workday or was it due to changes in sourcegraph. After few hours I've set disableAutoGitUpdates: true
, because I did not want to get paged in the middle of the night. Luckily all new repos were queued and in the morning everything was cloned.
Argument for keeping auto git updates disabled (disableAutoGitUpdates: true
). To my understanding it will try to update repositories that have not been updated for a while. So it is either a repo that no one pushed to for a while and/or no one opened it via UI for a while. In both cases it's likely that code is in maintenance or deprecated.
Argument for keeping auto git updates enabled (disableAutoGitUpdates: false
). In case webhook from gitolite fails - then git auto update has a chance to catch it in few days. Basically less likely to have outadate code in sourcegraph.
I don't have a definite answer if I'm comfortable to leave the auto updates on. I would love to have auto git updates enabled (disableAutoGitUpdates: false
) if it is possible to have request count within 30%-50% increase compared to having this feature disabled. Otherwise it would be nice to have newly discovered repos cloned with disableAutoGitUpdates: true
.
I've turned it off for now because request count from sourcegraph to gitolite skyrockets when its turned on. If it is possible to configure AutoGitUpdates to not issue so many requests, then I'm happy to use it.
On top of the semaphore, we could add in a rate limiter? Then you could comfortably set an upper bound on the number of requests per minute.
You can see the jump in requests right after we set
disableAutoGitUpdates: false
,gitMaxConcurrentClones
was set to 5:
Thank you so much for the graph, very useful context! A potential explanation occurred to me, the state for the scheduler is all kept in memory. So on changing the config we will check every repo, and then slowly over time the less frequently updated repos will get updated far less. But that backoff state is stored in memory. We can look into persisting that and we may get better behaviour.
It went from below 400 requests per minute to 1.7k-1.9k per minute. We did try setting
gitMaxConcurrentClones
to 8 and later we set it to 4. With 4 git max concurrent clones it seems to have stabilised to 600-800 req/min, but I'm not certain if US developers started their workday or was it due to changes in sourcegraph. After few hours I've setdisableAutoGitUpdates: true
, because I did not want to get paged in the middle of the night. Luckily all new repos were queued and in the morning everything was cloned.
What do you think about keeping the queue, but having a lower concurrent limit? IE use 2 instead of 5? That may be good enough for now?
Argument for keeping auto git updates disabled (
disableAutoGitUpdates: true
). To my understanding it will try to update repositories that have not been updated for a while. So it is either a repo that no one pushed to for a while and/or no one opened it via UI for a while. In both cases it's likely that code is in maintenance or deprecated.
If your webhooks are reliable, this makes sense and ideal to keep it off. In that case implementing cloning even if disableAutoGitUpdates is on makes sense.
Argument for keeping auto git updates enabled (
disableAutoGitUpdates: false
). In case webhook from gitolite fails - then git auto update has a chance to catch it in few days. Basically less likely to have outadate code in sourcegraph.
yes. This is important for global search, not for people visiting the repo since in that case it will be enqueued for update. I suppose this downside isn't that bad, since repos that are not being updated are less important. WDYT?
I don't have a definite answer if I'm comfortable to leave the auto updates on. I would love to have auto git updates enabled (
disableAutoGitUpdates: false
) if it is possible to have request count within 30%-50% increase compared to having this feature disabled. Otherwise it would be nice to have newly discovered repos cloned withdisableAutoGitUpdates: true
.
I see your point of view where disableAutoGitUpdates: true
actually works really well for your usecase, except for the cloning aspect of course.
Few thoughts before answering questions. I think that newly discovered repos should be cloned and indexed with high priority. Because it's quite likely a new and exciting project.
The situation where thousands of new repos are added to sourcegraph is less likely. Yet it can happen and in these cases it is okay to clone slowly.
On top of the semaphore, we could add in a rate limiter? Then you could comfortably set an upper bound on the number of requests per minute.
Sounds good.
But that backoff state is stored in memory. We can look into persisting that and we may get better behaviour.
Thank you for explanation. Have thought of initialising backoff state using last commit date? The older the last commit date is the less likely repo to get new commit. *this is base on my opinion
If your webhooks are reliable, this makes sense and ideal to keep it off. In that case implementing cloning even if disableAutoGitUpdates is on makes sense.
I think queueing newly discovered repos for cloning with fairly high priority would be the most desirable. Though I guess that happens when disableAutoGitUpdates
is set to false
.
... This is important for global search, not for people visiting the repo since in that case it will be enqueued for update. I suppose this downside isn't that bad, since repos that are not being updated are less important. WDYT?
I think it is really great feature to keep the index in sync with code base. I would love to use it if I can tune it. I'll try to collect some numbers, starting with:
Let me know if there are other metrics that could be useful.
@aisbaa So we found a regression in our git fetch scheduler where roughly every 10min we will re-enqueue git fetch for every repo! This regression has been around since v3.3.0
. Issue is here #8501. I believe once that is solved you should experiment with disableAutoGitUpdates: false
again. If you are still seeing too much load we can revisit this topic and implement something we discussed above.
@aisbaa our fix is in master and will be in our next point release (3.13.2
). We can also backport this fix. I am guessing you are happy to experiment with this when you upgrade to 3.13, and are not in a rush to switch it back on. If you are wanting to turn disableAutoGitUpdates: false
soon, then I can cut a release.
Thank you, no need to backport, I'll upgrade shortly to 3.13.2
.
Question description
We had 15-16k repos and added additional 24k small repos to sourcegraph. We have two clusters, and it seems that both stopped cloning repos. There are only few repositories cloned in past 24 hours.
I'm able to clone manually from gitserver pod. Could cloning process be impeded by something else? Maybe indexing is not catching up? Something else?
Additional context
We're running v3.9.4-1.