sourcegraph / sourcegraph-public-snapshot

Code AI platform with Code Search & Cody
https://sourcegraph.com
Other
10.11k stars 1.29k forks source link

Sourcegraph stopped cloning and indexing repos #8400

Closed aisbaa closed 4 years ago

aisbaa commented 4 years ago

Question description

We had 15-16k repos and added additional 24k small repos to sourcegraph. We have two clusters, and it seems that both stopped cloning repos. There are only few repositories cloned in past 24 hours.

> (sg-aws-us-east-1) > kubectl logs --since 24h -l app=gitserver -c gitserver | grep 'cloning repo' | wc -l
      30
> (sg-aws-us-east-1) >kubectl logs --since 24h -l app=gitserver -c gitserver | grep 'cloning repo' | wc -l
      47

I'm able to clone manually from gitserver pod. Could cloning process be impeded by something else? Maybe indexing is not catching up? Something else?

Additional context

We're running v3.9.4-1.

unknwon commented 4 years ago

I believe @keegancsmith has more context on this.

mrnugget commented 4 years ago

Just to get a little bit more context:

  1. Are all 24k additional repos synced to the Sourcegraph instance? As in: if you go to /site-admin/repositories and scroll to the bottom, is the number in X repositories total (showing first 20) what you expect?
  2. I remember from previous conversations that gitolite was really slow when listing repositories — could this maybe be the cause, that it's still listing repositories and didn't add all of them yet?
  3. Are there any other seemingly relevant log lines in the log output?
aisbaa commented 4 years ago
  1. 40688 repositories total (showing first 20)
  2. I think repo listing completed successfully, since I can search for these new repositories using repo keyword. And then I click on search result - sourcegraph starts cloning.
  3. I've discovered one more issue which might be related - indexer fails reach sourcegraph-frontend-internal svc.

Logs:

2020/02/14 08:30:36 error indexing gitolite/namespace/repo@hash: Get http://sourcegraph-frontend-internal/.internal/search/configuration: dial tcp REDACTED_IP:80: connect: cannot assign requested address

I've tried issuing request from indexer pod:

/home/sourcegraph # wget http://sourcegraph-frontend-internal/
Connecting to sourcegraph-frontend-internal (REDACTED_IP:80)
wget: server returned error: HTTP/1.1 404 Not Found
/home/sourcegraph # wget http://sourcegraph-frontend-internal/.internal/
Connecting to sourcegraph-frontend-internal (REDACTED_IP:80)
wget: can't connect to remote host (REDACTED_IP): Address not available
aisbaa commented 4 years ago

Deleted and created sourcegraph-internal-frontend service, will see if that solves indexer issue.

aisbaa commented 4 years ago

Recreating sourcegraph-internal-frontend service did not help.

slimsag commented 4 years ago

I've discovered one more issue which might be related - indexer fails reach sourcegraph-frontend-internal svc.

This is likely related, and signals the problem may be with your sourcegraph-frontend deployment. Can you check the logs of repo-updater to see if it also has similar connectivity issues with the frontend?

Also check the frontend logs to see if there is any indication there as to why it might not be responding (you can also just shoot us those in an email to support@sourcegraph.com if you don't want to dig through them yourself).

For background, indexing is an out-of-band process so it cannot impede cloning. repo-updater is what determines the schedule of repository updates and effectively just asks gitserver "try to git pull this repo now". If repo-updater cannot talk to the frontend-internal service, it may be blocked from enqueuing repo updates.

Also, if you have the builtin Grafana deployment active, you may also check https://sourcegraph.example.com/-/debug/grafana to see if any alerts have been firing there which would tell us if there may be a different issue here.

aisbaa commented 4 years ago

Repo updater works as usual, no connection errors logged. Frontend logs on standby cluster contain only refresh requests coming from gitolite. Repo updater logs do not contain connection errors. Though it does contain quite a few warnings:

t=2020-02-14T14:37:11+0000 lvl=warn msg="unable to construct clone URL for repo" name=REDACTED phabricator_id=PHID-REPO-REDACTED

Grafana does not work for some reason either. When I test connection from UI - it fails with 404. But from within container everything works:

/usr/share/grafana # wget http://prometheus:30090/metrics -O /dev/stdout | head
Connecting to prometheus:30090 (10.3.0.219:30090)
# HELP go_gc_duration_seconds A summary of the GC invocation durations.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 1.6535e-05
go_gc_duration_seconds{quantile="0.25"} 2.6565e-05
go_gc_duration_seconds{quantile="0.5"} 2.9371e-05
aisbaa commented 4 years ago

We discovered something interesting, it seems that indexer is stuck waiting for connection pool to free up:

indexer # netstat | grep TIME_WAIT | wc -l
22018
indexer # sysctl -a | grep net.ipv4.ip_local_port_range
net.ipv4.ip_local_port_range = 32768    60999
keegancsmith commented 4 years ago

I'm not sure if lsof gives info on TIME_WAIT, but it might be useful to see a snippet of the output of

apk add --no-cache lsof
lsof -iTCP -P

Additionally you can pass the -debug flag to zoekt-sourcegraph-indexserver to turn on more verbose debug logs. This might help?

Some context:

zoekt-sourcegraph-indexserver does the follow HTTP requests:

Then each to index each repo we spawn a subprocess zoekt-archive-index. If it decides to index it will fetch the working copy via an HTTP request. However, we only ever have one of these processes running at a time, and I assume if it dies the socket related resources are cleaned up.

This doesn't sound like anything that should leave TCP requests lieing around (except maybe zoekt-archive-index?). Would be interested to see debug logs/lsof output.

beyang commented 4 years ago

@aisbaa I just emailed you to set up time for a call. My initial thoughts on this:

keegancsmith commented 4 years ago

The indexer service has been removed as of 3.11 I believe.

Oh sorry, my comment above was referring to zoekt. I forgot about the indexer service. It should be deleted! It was last present in 2.13 (from 2018)!!

deploy-sourcegraph on master 
❯ git log -n1 -G indexer                                       
commit c01dfc624b8e8072c89e6c9ae4c5404d3368eb64
Author: Chris Wendt <chrismwendt@gmail.com>
Date:   Fri Dec 14 12:16:12 2018 -0800

    Remove lsp-proxy, indexer, xlang-* (#152)

deploy-sourcegraph on master 
❯ git describe --tags c01dfc624b8e8072c89e6c9ae4c5404d3368eb64 
v2.13.0-14-gc01dfc62

Maybe the fix here is to just remove the indexer service?

aisbaa commented 4 years ago

Upgraded to 3.11, the issue still persists. Will upgrade to 3.12.

/home/sourcegraph # lsof -iTCP -P
COMMAND   PID USER   FD   TYPE   DEVICE SIZE/OFF NODE NAME
zoekt-sou   8 root    5u  IPv6 30943662      0t0  TCP *:6072 (LISTEN)
zoekt-sou   8 root   24u  IPv4 40190741      0t0  TCP indexed-search-0.indexed-search.default.svc.cluster.local:35228->sourcegraph-frontend-internal.default.svc.cluster.local:80 (ESTABLISHED)
zoekt-sou   8 root   31u  IPv4 40194832      0t0  TCP indexed-search-0.indexed-search.default.svc.cluster.local:35260->sourcegraph-frontend-internal.default.svc.cluster.local:80 (ESTABLISHED)

/home/sourcegraph # netstat | awk '{ print $6 }' | sort | uniq -c
      1
      3 ESTABLISHED
      1 Foreign
      1 I-Node
  23052 TIME_WAIT
      1 servers)

/home/sourcegraph # pstree
-+= 00001 root /sbin/tini -- /bin/sh -c zoekt-sourcegraph-indexserver -index $DATA_DIR -sourcegraph_url $SRC_FRONTEND_INTERNAL -listen :6072 -interval 1m -cpu_fraction 1.0
 \-+= 00008 root zoekt-sourcegraph-indexserver -index /data/index -sourcegraph_url http://sourcegraph-frontend-internal -listen :6072 -interval 1m -cpu_fraction 1.0
   \-+- 55798 root zoekt-archive-index -name REDACTED -commit 92b8ad405c11ee3149f45e180351b62f172f428a -incremental -index /data/index -parallelism 8 -file_limit 1048576 -branch HEAD -download-limit-mbps 1000 -require_ctags http://sourcegraph-frontend-internal/.internal/git/REDACTED/tar/REDACTED
     \--- 55806 root /usr/local/bin/universal-ctags --_interactive=default --fields=* --languages=Basic,C,C#,C++,Clojure,Cobol,CSS,CUDA,D,Elixir,elm,Erlang,Go,haskell,Java,JavaScript,kotlin,Lisp,Lua,MatLab,ObjectiveC,OCaml,Perl,Perl6,PHP,Protobuf,Python,R,Ruby,Rust,scala,Scheme,Sh,swift,Tcl,typescript,tsx,Verilog,Vim --map-CSS=+.scss --map-CSS=+.less --map-CSS=+.sass

If I pstree couple of times I see that zoekt-archive-index index different repo each time.

aisbaa commented 4 years ago

How do you pass -debug ? adding args: to indexed search manifest did not work.

diff --git a/base/indexed-search/indexed-search.StatefulSet.yaml b/base/indexed-search/indexed-search.StatefulSet.yaml
index 2af21467..0d97201f 100644
--- a/base/indexed-search/indexed-search.StatefulSet.yaml
+++ b/base/indexed-search/indexed-search.StatefulSet.yaml
@@ -44,6 +44,8 @@ spec:
         - mountPath: /data
           name: data
       - env:
+        args:
+          - -debug
         image: index.docker.io/sourcegraph/zoekt-indexserver:0.0.20191204145522-b470e5f@sha256:84e9de8be269277d6e4711a61d0c9675a44d409a4bf7f7dd1b90a22175095fa7
         terminationMessagePolicy: FallbackToLogsOnError
         name: zoekt-indexserver

Indexed search fails to start with [FATAL tini (8)] exec -debug failed: No such file or directory

keegancsmith commented 4 years ago

Sorry, this isn't as easy to set since we have a bunch of args we pass in. Maybe try this:

args:
  - /bin/sh
  - "-c"
  - "zoekt-sourcegraph-indexserver -index $DATA_DIR -sourcegraph_url $SRC_FRONTEND_INTERNAL -listen :6072 -interval 1m -cpu_fraction 1.0 -debug"

I think I am seeing the same problem on sourcegraph.com, just not as bad. But a bunch of ports in TIME_WAIT with no associated PID. Best guess is zoekt-archive-index is not closing up sockets properly. I will investigate further. Apologies for this.

Edit: as a workaround you can try setting -interval 1m to a higher value. This will slow down how often we poll for indexing, but will result in less subprocesses taking up sockets in TIME_WAIT. We should have a fix available so this will just be a temporary measure.

aisbaa commented 4 years ago

Upgraded to 3.12.4 within minutes after restart with -debug flag TIME_WAIT grew to 20k. lsof -iTCP -P looks fine, 2-3 connections at a time.

aisbaa commented 4 years ago

The important question is - can this issue impact git cloning?

I'm pinging API to check how many repos are there to clone since few hours ago. The weird thing is that amount of repo to clone grew from 7317 to 7323. If I remember correctly on Friday it was around 6k repos queued.

keegancsmith commented 4 years ago

I think I may have solved this issue a few days ago without realising it https://github.com/sourcegraph/zoekt/commit/f438357f9b306c048899e3bc7e55da1454ff3b9c

Busy building a new version of the zoekt images, then I'd suggest trying it out on your 3.12.4 cluster.

keegancsmith commented 4 years ago

The important question is - can this issue impact git cloning?

I'm pinging API to check how many repos are there to clone since few hours ago. The weird thing is that amount of repo to clone grew from 7317 to 7323. If I remember correctly on Friday it was around 6k repos queued.

hah, good question. I would suspect it shouldn't unless somehow the kernel is not isolating networking correctly. Just a thought, have you done anything recently like change the number of gitservers?

aisbaa commented 4 years ago

I only upgraded to 3.9.4 a week ago, then on Wednesday noticed that repository cloning got stuck. Can't recall any other changes, but will check the git log. Depoying 3.12.7.

keegancsmith commented 4 years ago

Images are:

sourcegraph/zoekt-webserver:0.0.20200217151107-f2e5060
sourcegraph/zoekt-indexserver:0.0.20200217151405-f2e5060

However, the point about this actually affecting git is a good point. I would be surprised. So just to confirm you upgraded to 3.9.4 and then noticed it got stuck a little later? What version were you on before that? I'll take a look over the git log to look for anything suspicious.

aisbaa commented 4 years ago

So just to confirm you upgraded to 3.9.4 and then noticed it got stuck a little later? What version were you on before that? I'll take a look over the git log to look for anything suspicious.

We were on v3.8.1.

It seems you've fixed socket leak, it stays around 60ish waiting connections:

/home/sourcegraph # netstat | awk '{ print $6 }' | sort | uniq -c
      1
      2 ESTABLISHED
      1 Foreign
      1 I-Node
     58 TIME_WAIT
      1 servers)
/home/sourcegraph # lsof -iTCP -P
COMMAND   PID USER   FD   TYPE   DEVICE SIZE/OFF NODE NAME
zoekt-sou   9 root    3u  IPv4 52552745      0t0  TCP indexed-search-0.indexed-search.default.svc.cluster.local:60566->sourcegraph-frontend-internal.default.svc.cluster.local:80 (ESTABLISHED)
zoekt-sou   9 root    5u  IPv6 52552739      0t0  TCP *:6072 (LISTEN)
aisbaa commented 4 years ago

Pending repositories number jumped a little again, now its 7329.

aisbaa commented 4 years ago

We have disableAutoGitUpdates set to True. Could that be partly related? It was set ages ago.

mrnugget commented 4 years ago

We have disableAutoGitUpdates set to True. Could that be partly related? It was set ages ago.

Yes, this is highly related. If this is set to true then repo-updater won't forward new repositories to be cloned to gitserver.

mrnugget commented 4 years ago

Sorry, I might be wrong. Looking at the code, it's possible that disableAutoGitUpdates doesn't affect the initial cloning, but only the updating of repos.

mrnugget commented 4 years ago

Just for the record: this was fixed by changing the disableAutoGitUpdates value. See https://github.com/sourcegraph/sourcegraph/issues/8458

aisbaa commented 4 years ago

disableAutoGitUpdates was the culprit, setting it to false sourcegraph started cloning repositories.

keegancsmith commented 4 years ago

@aisbaa would you like to turn disableAutoGitUpdates back on? Trying to understand prioritization of supporting / fixing cloning when disableAutoGitUpdates is set to true.

aisbaa commented 4 years ago

I've turned it off for now because request count from sourcegraph to gitolite skyrockets when its turned on. If it is possible to configure AutoGitUpdates to not issue so many requests, then I'm happy to use it.

You can see the jump in requests right after we set disableAutoGitUpdates: false, gitMaxConcurrentClones was set to 5:

Screenshot 2020-02-18 at 21 56 44

It went from below 400 requests per minute to 1.7k-1.9k per minute. We did try setting gitMaxConcurrentClones to 8 and later we set it to 4. With 4 git max concurrent clones it seems to have stabilised to 600-800 req/min, but I'm not certain if US developers started their workday or was it due to changes in sourcegraph. After few hours I've set disableAutoGitUpdates: true, because I did not want to get paged in the middle of the night. Luckily all new repos were queued and in the morning everything was cloned.

Argument for keeping auto git updates disabled (disableAutoGitUpdates: true). To my understanding it will try to update repositories that have not been updated for a while. So it is either a repo that no one pushed to for a while and/or no one opened it via UI for a while. In both cases it's likely that code is in maintenance or deprecated.

Argument for keeping auto git updates enabled (disableAutoGitUpdates: false). In case webhook from gitolite fails - then git auto update has a chance to catch it in few days. Basically less likely to have outadate code in sourcegraph.


I don't have a definite answer if I'm comfortable to leave the auto updates on. I would love to have auto git updates enabled (disableAutoGitUpdates: false) if it is possible to have request count within 30%-50% increase compared to having this feature disabled. Otherwise it would be nice to have newly discovered repos cloned with disableAutoGitUpdates: true.

keegancsmith commented 4 years ago

I've turned it off for now because request count from sourcegraph to gitolite skyrockets when its turned on. If it is possible to configure AutoGitUpdates to not issue so many requests, then I'm happy to use it.

On top of the semaphore, we could add in a rate limiter? Then you could comfortably set an upper bound on the number of requests per minute.

You can see the jump in requests right after we set disableAutoGitUpdates: false, gitMaxConcurrentClones was set to 5:

Thank you so much for the graph, very useful context! A potential explanation occurred to me, the state for the scheduler is all kept in memory. So on changing the config we will check every repo, and then slowly over time the less frequently updated repos will get updated far less. But that backoff state is stored in memory. We can look into persisting that and we may get better behaviour.

It went from below 400 requests per minute to 1.7k-1.9k per minute. We did try setting gitMaxConcurrentClones to 8 and later we set it to 4. With 4 git max concurrent clones it seems to have stabilised to 600-800 req/min, but I'm not certain if US developers started their workday or was it due to changes in sourcegraph. After few hours I've set disableAutoGitUpdates: true, because I did not want to get paged in the middle of the night. Luckily all new repos were queued and in the morning everything was cloned.

What do you think about keeping the queue, but having a lower concurrent limit? IE use 2 instead of 5? That may be good enough for now?

Argument for keeping auto git updates disabled (disableAutoGitUpdates: true). To my understanding it will try to update repositories that have not been updated for a while. So it is either a repo that no one pushed to for a while and/or no one opened it via UI for a while. In both cases it's likely that code is in maintenance or deprecated.

If your webhooks are reliable, this makes sense and ideal to keep it off. In that case implementing cloning even if disableAutoGitUpdates is on makes sense.

Argument for keeping auto git updates enabled (disableAutoGitUpdates: false). In case webhook from gitolite fails - then git auto update has a chance to catch it in few days. Basically less likely to have outadate code in sourcegraph.

yes. This is important for global search, not for people visiting the repo since in that case it will be enqueued for update. I suppose this downside isn't that bad, since repos that are not being updated are less important. WDYT?

I don't have a definite answer if I'm comfortable to leave the auto updates on. I would love to have auto git updates enabled (disableAutoGitUpdates: false) if it is possible to have request count within 30%-50% increase compared to having this feature disabled. Otherwise it would be nice to have newly discovered repos cloned with disableAutoGitUpdates: true.

I see your point of view where disableAutoGitUpdates: true actually works really well for your usecase, except for the cloning aspect of course.

aisbaa commented 4 years ago

Few thoughts before answering questions. I think that newly discovered repos should be cloned and indexed with high priority. Because it's quite likely a new and exciting project.

The situation where thousands of new repos are added to sourcegraph is less likely. Yet it can happen and in these cases it is okay to clone slowly.


On top of the semaphore, we could add in a rate limiter? Then you could comfortably set an upper bound on the number of requests per minute.

Sounds good.

But that backoff state is stored in memory. We can look into persisting that and we may get better behaviour.

Thank you for explanation. Have thought of initialising backoff state using last commit date? The older the last commit date is the less likely repo to get new commit. *this is base on my opinion

If your webhooks are reliable, this makes sense and ideal to keep it off. In that case implementing cloning even if disableAutoGitUpdates is on makes sense.

I think queueing newly discovered repos for cloning with fairly high priority would be the most desirable. Though I guess that happens when disableAutoGitUpdates is set to false.

... This is important for global search, not for people visiting the repo since in that case it will be enqueued for update. I suppose this downside isn't that bad, since repos that are not being updated are less important. WDYT?

I think it is really great feature to keep the index in sync with code base. I would love to use it if I can tune it. I'll try to collect some numbers, starting with:

Let me know if there are other metrics that could be useful.

keegancsmith commented 4 years ago

@aisbaa So we found a regression in our git fetch scheduler where roughly every 10min we will re-enqueue git fetch for every repo! This regression has been around since v3.3.0. Issue is here #8501. I believe once that is solved you should experiment with disableAutoGitUpdates: false again. If you are still seeing too much load we can revisit this topic and implement something we discussed above.

keegancsmith commented 4 years ago

@aisbaa our fix is in master and will be in our next point release (3.13.2). We can also backport this fix. I am guessing you are happy to experiment with this when you upgrade to 3.13, and are not in a rush to switch it back on. If you are wanting to turn disableAutoGitUpdates: false soon, then I can cut a release.

aisbaa commented 4 years ago

Thank you, no need to backport, I'll upgrade shortly to 3.13.2.