hound-search / hound

Lightning fast code searching made easy
MIT License
5.68k stars 578 forks source link

Hound's git usage is bad in multiple ways. Solutions inside #249

Open avar opened 7 years ago

avar commented 7 years ago

There's multiple bugs with how hound is using Git, and it's often doing bad things for no reason, and there's multiple outstanding issues & PRs which try to build on this bad behavior, rather than fixing the underlying issues.

Some of those issues/PRs are:

So, looking at what the git driver does:

The issues with this are that it shouldn't use --depth, as noted in my #207, and that it's hardcoding the master branch, furthermore this whole --no-tags option combined with how the clone doesn't do what the author intended. We clone all the tags initially, but then we just don't update them, so e.g. with gc/repack we still have to maintain all those stale tags.

I've hacked houndd locally to use a git wrapper script which fixes up its bad git usage, the way this works is:

This way we clone whatever branch the HEAD points to on the remote side, e.g. master, or trunk or whatever. Then right after the clone we delete all the tags, they won't be fetched again due to the --no-tags tagOpt.

There's no reason to supply any arguments to fetch, the ref info takes care of all that, nor as noted in #207 should we use the inefficient --depth=1, and there's no reason for --no-tags since it's in our config at this point.

We then reset to @{u}, not a hardcoded master, this will work whatever the HEAD branch is.

The wrapper script I'm using is the following, it's slightly more complex because it works before & after d99d1db. The insteadOf line is specific to my site, for reasons I won't go into I'm munging the repo targets.

#!/usr/bin/perl
use strict;
use warnings;
use autodie qw(:all);

my $orig_args = "@ARGV";
my $args = $orig_args;

# Because of https://github.com/etsy/hound/issues/207, and
# --single-branch is what we actually want.
$args =~ s/clone.*?\K--depth 1/--single-branch/;
# We also need to handle pull, see
# https://github.com/etsy/hound/commit/d99d1db
$args =~ s[(?=^(?:clone|pull|fetch))][-c "url.http://git.example.com/git/.insteadOf=ssh://git.example.com/gitroot/" ];
$args =~ s/pull$/fetch/;
# ... and handle the new bad fetch & reset commands
$args =~ s/ fetch\K.*//; # No need to give fetch *any* args
$args =~ s/^reset\K.*/ --hard \@{u}/;

# sudo tail -f /var/log/messages | grep hound-gitwrapper
system(
    "logger",
    "-t",
    "hound-gitwrapper",
    "git with args <$args>" . ($args ne $orig_args ? " (munged from <$orig_args>)" : ""),
);

system "/usr/bin/git $args";

# NOTE: I am intentionally not using /usr/bin/git here, but git, so
# this gets fed into this same script again for syslogging!
if ($orig_args =~ /^clone /) {
    my ($repo_path) = $args =~ m[ (vcs-[0-9a-f]+)];

    system "git --git-dir=$repo_path/.git config remote.origin.tagOpt --no-tags";
    # Will succeed if there are no tags since -l will return an empty list
    system "git --git-dir=$repo_path/.git tag -l | xargs /usr/bin/git --git-dir=$repo_path/.git tag -d";
} elsif ($orig_args eq 'pull') {
    system "git reset --hard \@{u}";
}

I'm running houndd via supervisor and setting environment = PATH="/usr/lib/houndd/bin:/usr/bin" and dropping this as git in /usr/lib/houndd/bin works for me, fixes the bug with not cloning repos with a non-master main branch, reduces load on our git server due to not using --depth=1, and with this running for-each-ref on all the repos shows that only the main branch ref is being maintained, in the data dir:

$ find . -name '.git' -exec git --git-dir={} for-each-ref \;|grep -v remotes|awk '{print $3}'|sort|uniq -c|sort -nr
    254 refs/heads/master
      2 refs/heads/trunk
      1 refs/heads/frunk

I don't have the want/Go skills to easily patch git.go, and I need to maintain this wrapper anyway because I'm doing some further magic (dispatching to LB'd git slaves) which won't ever get upstreamed anyway, but wanted to file this to show what the solution to almost all the complains people have with git & hound in the aforementioned issues is.

avar commented 7 years ago

Future versions of git will have a git clone --no-tags --single-branch feature. See my patch on the git mailing list, we're in the middle of a release cycle so I'm not sure when that'll be released.

In the meantime following tags is not a big deal, so in lieu of the the whole tagOpt / tag -l / tag -d dance (which equivalently works) you could just use --single-branch.

dmsimard commented 6 years ago

https://github.com/etsy/hound/pull/275 provides the ability to specify a git branch other than the hardcoded "master" and it works for us.

mikepurvis commented 6 years ago

We had a bunch of issues with repos where the remote URL or branch would change, or the branch would have been force-pushed to (I know, I know). I can't represent it as being optimal, but the following has worked without problem across several months of continuous use:

git remote set-url origin $URL
git fetch --prune --no-tags --depth 1 origin +$REF:remotes/origin/$REF
git reset --hard origin/$REF

The one caveat is that I believe this only works for tags and branches, not SHA strings. I haven't found a non-branching solution which would work for that as well.

Implementation:

https://github.com/mikepurvis/hound/blob/c21c82272279f72693588ee6dd70275b884f1ba0/vcs/git.go#L47-L73