Closed daxmc99 closed 5 months ago
Workaround: Open the failing repositories via the repository status page. Manually visiting the repository in Sourcegraph will enqueue manual sync which succeeds since these repositories aren't actually unreachable.
Looks like this one is creeping up in some customer managed deployments. We're seeing a few customer issues where repos are not updated until manually prompted. The Cloud NAT Gateway metrics show an abrupt drop in connections and a spike in sent packets being rejected by GitHub. Similarly if we inspect the logs during that time we can see connection requests being DROPPED
by the upstream server (GitHub).
Definitely something to keep an eye on. If this one continues to crop-up we made need to start taking action to more aggressively retry operations on failure and add in some specific monitoring on Cloud NAT metrics to alert when we're being dropped.
Heads up @jplahn - the "team/repo-management" label was applied to this issue.
Hey @sourcegraph/repo-management just wanting to flag that on prem users are still running into this fairly regularly in v3.42.x =<. Currently we're advising removal of the repos and pointing admins to the following doc: https://docs.sourcegraph.com/admin/how-to/remove-repo#remove-corrupted-repository-data-from-sourcegraph
@ryphil for prioritization
For Sourcegraph admins running into this issue we advise that the following env vars be set in gitserver
, until our team can come to a more permanent solution:
- 'SRC_ENABLE_GC_AUTO=true'
- 'SRC_ENABLE_SG_MAINTENANCE=false'
This effectively reverts sg maintenance
@DaedalusG I'm looking into this and gaining context currently. I'm a bit confused as to the connection between sg maintenance/repo corruptions and the failure to fetch issue. Those seem distinct and unrelated to me, but maybe I'm missing something? Are we just seeing that running git remote show
is returning errors when the repo is corrupted?
@mollylogue my apologies that message is indeed unrelated. To this exact stream. I got my wires crossed since both issues are common on monorepos, and I was running into both issues with a customer at the time.
Actually though for some historical context on this -- check out the following issue: https://github.com/sourcegraph/customer/issues/586 This was from around the time we started developing sg maintenance
and has to do with timeouts during fetch requests between zoekt and gitserver on a monorepo, apologies if its a goose chase, but might be worth a look
tl;dr
git operation that makes network requests to code host is terminated by code host
Steps to reproduce:
failed to ensure HEAD exists: failed to fetch remote info: exit status 128
The underlying git command ran here is
git remote show $REPO
which performs a web request to get data on the repo.~This error is very hard to reproduce but it has a large negative effect in that it causes a group of repos to appear as not being able to sync which typically results in support needing to be engaged.~ The error is hard to re-produce consistently, but it occurs quite often on managed instances where Cloud NAT is used (see comments below)
During a few debug sessions, it was observed that there is a high amount of RX packet loss on these instances.
Hallmark of this issue: High packet loss on the default gateway
The specific notification: "Some repositories could not be synced"
Expected behavior:
git remote show
does not error and the notification "Some repositories could not be synced" is not thrown.Actual behavior:
These calls fail sporadically and list repos as being unable to sync
See linked issues for more in-depth debugging