Add ability to schedule janitor jobs for off-hours

dadlerj commented 4 years ago

Reported by

We run a repo janitor job that re-clones repos from scratch rather than running git gc, due to the former generally being faster.

The first customer above requested to be able to customize this, particularly to run it off-hours (since clones take a while to finish):

Can we do that per repo ?

Can we set the exact time of that ? (Usually these kind of operations should be targeted for off-hours)

mrnugget commented 4 years ago

This comment is highly relevant to someone picking this up: https://github.com/sourcegraph/sourcegraph/issues/5056#issuecomment-518551181

slimsag commented 4 years ago

For more context, please see this thread in Slack: https://sourcegraph.slack.com/archives/CHPC7UX16/p1588198426276700?thread_ts=1588174007.259800&cid=CHPC7UX16

Right now, the code @keegancsmith had put in place for us makes this happen every 45 days +- an 11.5 day jitter we added after we realized purging all repos at once was not a good idea. Code: https://sourcegraph.com/github.com/sourcegraph/sourcegraph@2a83e183093e3f94b68add774536f7564287beeb/-/blob/cmd/gitserver/server/cleanup.go#L35-36

https://app.hubspot.com/contacts/2762526/company/554275594 raised a good point which is that there is no reason this should not be doable in the background: we could clone the repository, then do an atomic FS swap of the directories -- so the recloning is never visible to users at all

slimsag commented 4 years ago

Update:

I have now informed both customers this issue can be worked-around by running the following in the relevant repository:

cd ~/sourcegraph/.data/repos/$REPOSITORY
git config sourcegraph.recloneTimestamp 2147483647

In the case of https://app.hubspot.com/contacts/2762526/company/554338610 -- it is most desirable to make these seamless background operations and handle the case of sourcegraph.recloneTimestamp being exactly equal to 2147483647 on our end so they do not need to rerun any further commands.

In the case of https://app.hubspot.com/contacts/2762526/company/554275594 their repository CANNOT be recloned as they use custom git commands and copy their repository directly into Sourcegraph's containers, so it seems a background disk-copy (not clone) + git gc + atomic swap would be most ideal here.

keegancsmith commented 4 years ago

@dadlerj Can we get the actual logs from gitserver from when this happened. See below, but we shouldn't be deleting. So I just want to make sure it isn't something else that is happening.

Add ability to schedule janitor jobs for off-hours

This should only apply to some jobs. Some jobs need to run regularly to detect bad repos / etc. I agree that our maybeReclone could run on weekends only/etc. This is also a relatively easy change to make.

I have now informed both customers this issue can be worked-around by running the following in the relevant repository

Thanks for this, that should work. However, we shouldn't be deleting the repository, see below. I agree though in the case of custom setups this behaviour is undesirable. What do you propose we do instead? Just rely on git gc?

Right now, the code @keegancsmith had put in place for us makes this happen every 45 days +- an 11.5 day jitter we added after we realized purging all repos at once was not a good idea.

FYI this was always what we did. There was a regression (which lasted for many months before we noticed) that ended up making the jitter ineffective. That has since been fixed.

we could clone the repository, then do an atomic FS swap of the directories -- so the recloning is never visible to users at all

We do exactly that already. Follow the path down from where we call cloneRepo in the janitor job (note at this point the repo still exists). We pass in Overwrite: true. Note we clone to a tmp location then do the swap as follows:

mv repo old
mv tmp repo

I suppose it is possible there are bugs here? eg if we fail between 1 and 2 then we would just kill the repos. If it fails there should be logs about "maybe reclone" cleanup failing to confirm that.

There are two places we delete repos:

If we think a repo is corrupted we just remove it. maybeRemoveCorrupt
If repo-updater thinks the repo doesn't exist, it tells gitserver to remove it. It should only do this on weekends.

Either way there should be logs explaining it. From the previous discussions it seems we are fairly certain this is due to maybe reclone.

dadlerj commented 4 years ago

Ah, I'm not sure if that's definitively the source of the issue. I defer to @slimsag on this.

keegancsmith commented 4 years ago

So the issue was that when we are recloning we grab a "clone lock" on the directory. When running a git command we would check the clone lock, and if someone had it we would return "clone in progress". However, the git clone was still there to be used. So when recloning the repo would be unavailable as reported above, even though our reclone logic does the cloning in tmp dir logic. I've sent out a PR to fix that! https://github.com/sourcegraph/sourcegraph/pull/10663

dadlerj commented 3 years ago

Reported and requested by https://app.hubspot.com/contacts/2762526/company/693777200

This seems important for any companies with large monorepos.

Hi Team, Have a question about the gitserver -- This morning we saw the gitserver reclone an expired repo, is there a way to configure when or how often it does this cloning? Our mono repo is pretty big and while it's cloning, users are not able to search it. It took around 30-40 minutes to reclone the repo.

t=2020-11-06T17:56:30+0000 lvl=info msg="recloning expired repo" repo=$REDACTED cloned=2020-09-11T16:05:59+0000 reason=old

keegancsmith commented 3 years ago

This shouldn't happen when recloning. It was a bug we had that we fixed in #10663. So it has been fixed since v3.16.0. I assume the above company is on a newer version than that. @tsenart does @sourcegraph/cloud have any capacity to investigate recloning preventing searches?

tsenart commented 3 years ago

@keegancsmith: We can make room, yes. Can you please summarise your latest understanding of the problem and where to begin? Any solutions in mind?

keegancsmith commented 3 years ago

The first thing to do is try to recreate it. I would recommend updating the git clone code to add in an arb "time.Sleep" then getting the reclone job to run (you can adjust some timestamps).

If you look at the linked PR, the problem was that while recloning we held a "clone lock". This clone lock prevented normal commands from running, even though the clone was available. The PR ensured we only checked the clone lock if the repo was not cloned already.

A guess is some behaviour around EnsureRevision causing a fetch, which relies on getting a clone lock. However, I am unsure.

flying-robot commented 3 years ago

@keegancsmith @tsenart I'll investigate the recloning behavior today.

flying-robot commented 3 years ago

@keegancsmith @tsenart I inserted a time.Sleep(time.Hour) on cmd/gitserver/server/server.go#L821 and forced one of my local repos to be recloned.

Search worked as intended, but navigating into a file (by way of its top-level links, or clicking within results) took me to the interstitial page showing "Cloning in progress - starting clone".

keegancsmith commented 3 years ago

Search worked as intended, but navigating into a file (by way of its top-level links, or clicking within results) took me to the interstitial page showing "Cloning in progress - starting clone".

Nice find! Educated guess: this may be related to the stuff @tsenart and I want to remove: "GitRepo". I would guess we validate a repo exists, which speaks to repo-updater asking for info => gitserver says its cloning.

I would suspect consolidating places we check clone progress to not return anything if the repodir exists would make sense to fix this, without even knowing the proper cause of the above code path. That is how I fixed the issue previously in the exec handler, but we probably need to do the same sort of check in other handlers.

flying-robot commented 3 years ago

Nice find! Educated guess: this may be related to the stuff @tsenart and I want to remove: "GitRepo". I would guess we validate a repo exists, which speaks to repo-updater asking for info => gitserver says its cloning. I would suspect consolidating places we check clone progress to not return anything if the repodir exists would make sense to fix this, without even knowing the proper cause of the above code path. That is how I fixed the issue previously in the exec handler, but we probably need to do the same sort of check in other handlers.

ACK, will investigate and discuss with @tsenart. I also grabbed the git commands that were run for unsuccessful requests while sleeped-cloning, just to confirm the state of the repo on disk:

$ git rev-parse 22c6ec76d9c521247fddc61a09551eb2cc504a00^{tree}
2177089955da338ed39cc1efec92c5a917e96d89

$ git cat-file -t -- 2177089955da338ed39cc1efec92c5a917e96d89
tree

$ git rev-parse main^0
22c6ec76d9c521247fddc61a09551eb2cc504a00

$ git ls-tree --long --full-name -z 22c6ec76d9c521247fddc61a09551eb2cc504a00 -- TOML.sublime-syntax
100644 blob ec873612e7b286c9a6069340187ceb757c27f83c    6827    TOML.sublime-syntax

$ git rev-parse 22c6ec76d9c521247fddc61a09551eb2cc504a00^0
22c6ec76d9c521247fddc61a09551eb2cc504a00

emchap commented 3 years ago

@sourcegraph/core-application Wanted to alert you to this issue since it's on a board I'm not as familiar with, and @keegancsmith mentioned it would be your issue now in the conversation at https://sourcegraph.slack.com/archives/C0W2E592M/p1626104961005400.

github-actions[bot] commented 2 years ago

Heads up @jplahn @dan-mckean - the "team/repo-management" label was applied to this issue.

sourcegraph / sourcegraph-public-snapshot

Add ability to schedule janitor jobs for off-hours #10274