gitserver is unable to recover from failed clone/fetch operation

daxmc99 commented 3 years ago

tl;dr

git operation that makes network requests to code host is terminated by code host

Steps to reproduce:

We get an error from here: https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/cmd/gitserver/server/server.go?L1952&utm_product_name=GoLand&utm_product_version=GoLand#L1951:22 Typically with the logs failed to ensure HEAD exists: failed to fetch remote info: exit status 128

The underlying git command ran here is git remote show $REPO which performs a web request to get data on the repo.

Large numbers of repos are reported as being unable to sync. However, after manually triggering a sync (by visiting the repo or via repo settings) the repo syncs correctly.

~This error is very hard to reproduce but it has a large negative effect in that it causes a group of repos to appear as not being able to sync which typically results in support needing to be engaged.~ The error is hard to re-produce consistently, but it occurs quite often on managed instances where Cloud NAT is used (see comments below)

During a few debug sessions, it was observed that there is a high amount of RX packet loss on these instances.

Hallmark of this issue: High packet loss on the default gateway

sudo netstat -i
Kernel Interface table
Iface      MTU    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
br-ded03  1500  6161021      0      0 0       9637446      0      0      0 BMRU
docker0   1500        0      0      0 0             0      0      0      0 BMU
ens4      1460 99260061      0   2904 0       6273142      0      0      0 BMRU
lo       65536    46596      0      0 0         46596      0      0      0 LRU
veth022f  1500      172      0      0 0          4498      0      0      0 BMRU
...

The specific notification: "Some repositories could not be synced"

Expected behavior:

git remote show does not error and the notification "Some repositories could not be synced" is not thrown.

Actual behavior:

These calls fail sporadically and list repos as being unable to sync

See linked issues for more in-depth debugging

daxmc99 commented 3 years ago

Workaround: Open the failing repositories via the repository status page. Manually visiting the repository in Sourcegraph will enqueue manual sync which succeeds since these repositories aren't actually unreachable.

michaellzc commented 2 years ago

danieldides commented 2 years ago

Looks like this one is creeping up in some customer managed deployments. We're seeing a few customer issues where repos are not updated until manually prompted. The Cloud NAT Gateway metrics show an abrupt drop in connections and a spike in sent packets being rejected by GitHub. Similarly if we inspect the logs during that time we can see connection requests being DROPPED by the upstream server (GitHub).

Definitely something to keep an eye on. If this one continues to crop-up we made need to start taking action to more aggressively retry operations on failure and add in some specific monitoring on Cloud NAT metrics to alert when we're being dropped.

Screen Shot 2022-04-20 at 16 16 22

sourcegraph-bot-2 commented 2 years ago

Heads up @jplahn - the "team/repo-management" label was applied to this issue.

DaedalusG commented 2 years ago

Hey @sourcegraph/repo-management just wanting to flag that on prem users are still running into this fairly regularly in v3.42.x =<. Currently we're advising removal of the repos and pointing admins to the following doc: https://docs.sourcegraph.com/admin/how-to/remove-repo#remove-corrupted-repository-data-from-sourcegraph

jplahn commented 2 years ago

@ryphil for prioritization

DaedalusG commented 2 years ago

For Sourcegraph admins running into this issue we advise that the following env vars be set in gitserver, until our team can come to a more permanent solution:

    - 'SRC_ENABLE_GC_AUTO=true'
    - 'SRC_ENABLE_SG_MAINTENANCE=false'

This effectively reverts sg maintenance

mollylogue commented 2 years ago

@DaedalusG I'm looking into this and gaining context currently. I'm a bit confused as to the connection between sg maintenance/repo corruptions and the failure to fetch issue. Those seem distinct and unrelated to me, but maybe I'm missing something? Are we just seeing that running git remote show is returning errors when the repo is corrupted?

DaedalusG commented 2 years ago

@mollylogue my apologies that message is indeed unrelated. To this exact stream. I got my wires crossed since both issues are common on monorepos, and I was running into both issues with a customer at the time.

Actually though for some historical context on this -- check out the following issue: https://github.com/sourcegraph/customer/issues/586 This was from around the time we started developing sg maintenance and has to do with timeouts during fetch requests between zoekt and gitserver on a monorepo, apologies if its a goose chase, but might be worth a look

sourcegraph / sourcegraph-public-snapshot