Open dadlerj opened 4 years ago
This comment is highly relevant to someone picking this up: https://github.com/sourcegraph/sourcegraph/issues/5056#issuecomment-518551181
For more context, please see this thread in Slack: https://sourcegraph.slack.com/archives/CHPC7UX16/p1588198426276700?thread_ts=1588174007.259800&cid=CHPC7UX16
Right now, the code @keegancsmith had put in place for us makes this happen every 45 days +- an 11.5 day jitter we added after we realized purging all repos at once was not a good idea. Code: https://sourcegraph.com/github.com/sourcegraph/sourcegraph@2a83e183093e3f94b68add774536f7564287beeb/-/blob/cmd/gitserver/server/cleanup.go#L35-36
https://app.hubspot.com/contacts/2762526/company/554275594 raised a good point which is that there is no reason this should not be doable in the background: we could clone the repository, then do an atomic FS swap of the directories -- so the recloning is never visible to users at all
Update:
I have now informed both customers this issue can be worked-around by running the following in the relevant repository:
cd ~/sourcegraph/.data/repos/$REPOSITORY
git config sourcegraph.recloneTimestamp 2147483647
In the case of https://app.hubspot.com/contacts/2762526/company/554338610 -- it is most desirable to make these seamless background operations and handle the case of sourcegraph.recloneTimestamp
being exactly equal to 2147483647
on our end so they do not need to rerun any further commands.
In the case of https://app.hubspot.com/contacts/2762526/company/554275594 their repository CANNOT be recloned as they use custom git commands and copy their repository directly into Sourcegraph's containers, so it seems a background disk-copy (not clone) + git gc
+ atomic swap would be most ideal here.
@dadlerj Can we get the actual logs from gitserver from when this happened. See below, but we shouldn't be deleting. So I just want to make sure it isn't something else that is happening.
Add ability to schedule janitor jobs for off-hours
This should only apply to some jobs. Some jobs need to run regularly to detect bad repos / etc. I agree that our maybeReclone could run on weekends only/etc. This is also a relatively easy change to make.
I have now informed both customers this issue can be worked-around by running the following in the relevant repository
Thanks for this, that should work. However, we shouldn't be deleting the repository, see below. I agree though in the case of custom setups this behaviour is undesirable. What do you propose we do instead? Just rely on git gc?
Right now, the code @keegancsmith had put in place for us makes this happen every 45 days +- an 11.5 day jitter we added after we realized purging all repos at once was not a good idea.
FYI this was always what we did. There was a regression (which lasted for many months before we noticed) that ended up making the jitter ineffective. That has since been fixed.
we could clone the repository, then do an atomic FS swap of the directories -- so the recloning is never visible to users at all
We do exactly that already. Follow the path down from where we call cloneRepo in the janitor job (note at this point the repo still exists). We pass in Overwrite: true
. Note we clone to a tmp location then do the swap as follows:
I suppose it is possible there are bugs here? eg if we fail between 1 and 2 then we would just kill the repos. If it fails there should be logs about "maybe reclone" cleanup failing to confirm that.
There are two places we delete repos:
Either way there should be logs explaining it. From the previous discussions it seems we are fairly certain this is due to maybe reclone.
Ah, I'm not sure if that's definitively the source of the issue. I defer to @slimsag on this.
So the issue was that when we are recloning we grab a "clone lock" on the directory. When running a git command we would check the clone lock, and if someone had it we would return "clone in progress". However, the git clone was still there to be used. So when recloning the repo would be unavailable as reported above, even though our reclone logic does the cloning in tmp dir logic. I've sent out a PR to fix that! https://github.com/sourcegraph/sourcegraph/pull/10663
Reported and requested by https://app.hubspot.com/contacts/2762526/company/693777200
This seems important for any companies with large monorepos.
Hi Team, Have a question about the gitserver -- This morning we saw the gitserver reclone an expired repo, is there a way to configure when or how often it does this cloning? Our mono repo is pretty big and while it's cloning, users are not able to search it. It took around 30-40 minutes to reclone the repo.
t=2020-11-06T17:56:30+0000 lvl=info msg="recloning expired repo" repo=$REDACTED cloned=2020-09-11T16:05:59+0000 reason=old
This shouldn't happen when recloning. It was a bug we had that we fixed in #10663. So it has been fixed since v3.16.0. I assume the above company is on a newer version than that. @tsenart does @sourcegraph/cloud have any capacity to investigate recloning preventing searches?
@keegancsmith: We can make room, yes. Can you please summarise your latest understanding of the problem and where to begin? Any solutions in mind?
The first thing to do is try to recreate it. I would recommend updating the git clone code to add in an arb "time.Sleep" then getting the reclone job to run (you can adjust some timestamps).
If you look at the linked PR, the problem was that while recloning we held a "clone lock". This clone lock prevented normal commands from running, even though the clone was available. The PR ensured we only checked the clone lock if the repo was not cloned already.
A guess is some behaviour around EnsureRevision causing a fetch, which relies on getting a clone lock. However, I am unsure.
@keegancsmith @tsenart I'll investigate the recloning behavior today.
@keegancsmith @tsenart I inserted a time.Sleep(time.Hour)
on cmd/gitserver/server/server.go#L821 and forced one of my local repos to be recloned.
Search worked as intended, but navigating into a file (by way of its top-level links, or clicking within results) took me to the interstitial page showing "Cloning in progress - starting clone".
Search worked as intended, but navigating into a file (by way of its top-level links, or clicking within results) took me to the interstitial page showing "Cloning in progress - starting clone".
Nice find! Educated guess: this may be related to the stuff @tsenart and I want to remove: "GitRepo". I would guess we validate a repo exists, which speaks to repo-updater asking for info => gitserver says its cloning.
I would suspect consolidating places we check clone progress to not return anything if the repodir exists would make sense to fix this, without even knowing the proper cause of the above code path. That is how I fixed the issue previously in the exec handler, but we probably need to do the same sort of check in other handlers.
Nice find! Educated guess: this may be related to the stuff @tsenart and I want to remove: "GitRepo". I would guess we validate a repo exists, which speaks to repo-updater asking for info => gitserver says its cloning. I would suspect consolidating places we check clone progress to not return anything if the repodir exists would make sense to fix this, without even knowing the proper cause of the above code path. That is how I fixed the issue previously in the exec handler, but we probably need to do the same sort of check in other handlers.
ACK, will investigate and discuss with @tsenart. I also grabbed the git commands that were run for unsuccessful requests while sleeped-cloning, just to confirm the state of the repo on disk:
$ git rev-parse 22c6ec76d9c521247fddc61a09551eb2cc504a00^{tree}
2177089955da338ed39cc1efec92c5a917e96d89
$ git cat-file -t -- 2177089955da338ed39cc1efec92c5a917e96d89
tree
$ git rev-parse main^0
22c6ec76d9c521247fddc61a09551eb2cc504a00
$ git ls-tree --long --full-name -z 22c6ec76d9c521247fddc61a09551eb2cc504a00 -- TOML.sublime-syntax
100644 blob ec873612e7b286c9a6069340187ceb757c27f83c 6827 TOML.sublime-syntax
$ git rev-parse 22c6ec76d9c521247fddc61a09551eb2cc504a00^0
22c6ec76d9c521247fddc61a09551eb2cc504a00
@sourcegraph/core-application Wanted to alert you to this issue since it's on a board I'm not as familiar with, and @keegancsmith mentioned it would be your issue now in the conversation at https://sourcegraph.slack.com/archives/C0W2E592M/p1626104961005400.
Heads up @jplahn @dan-mckean - the "team/repo-management" label was applied to this issue.
Reported by
We run a repo janitor job that re-clones repos from scratch rather than running git gc, due to the former generally being faster.
The first customer above requested to be able to customize this, particularly to run it off-hours (since clones take a while to finish):