Locally cached git repository

jacobat commented 12 years ago

To not have to download the entire repository over and over it would be nice to be able to keep a local copy.

I've implemented an example of how this could work at jacobat/mina@cf176f74394ce287205a4d5d2b2077aac052aa6d

I'm not sure if there are any issues with the implementation, so for now it's only meant for inspiration.

What do you think?

bluestrike2 commented 12 years ago

I haven't really gotten into things, but I've been thinking about it and I'm not sure why we don't just jump through to what's basically taking this idea through to its furthest conclusion? Something similar to what was described in this blog post a ways back. I'll probably work something up for myself, so I figured I'd throw the idea out there.

jacobat commented 12 years ago

What you're suggesting is in essence to keep a git repository in the deployed working directory instead of copying into the working directory, is that correct?

IMHO that's an interesting option. Though I would prefer a copying strategy by default. I'm not really sure why, maybe it's just because that's what I've gotten used to with Capistrano. The git in working directory seems like a more advanced strategy with possibly more pitfalls.

A whole other issue is wether it is actually a good idea to require that git is installed on the server and has access to the master repository at all.

bluestrike2 commented 12 years ago

That's really the reason I wanted to throw the idea out there before I started on anything and hopefully narrow down the idea a bit more even if winds up just being some forked tweaks for my own needs.

It means a huge change in the default behavior. Even though pretty much all the other tasks stay the same (symlinking across to a shared director for logs/uploads/whatever, etc.) stay the same & your repo lives in a /current directory, there's still a pretty big conceptual difference involved.

During the deploy process, the opportunity for issue with current requests ought to be no different than with any other deploy method (in the split seconds you're checking out the new files once they've already been fetched, your OS is still caching disk i/o -- so I'd think the opportunity for a royal problem here approaches zero) with the biggest delay between checkout and restart taking place during the symlink period, which is where mina's approach really shines regardless). Caveat being unless I'm missing something else here, that being the assumption I tend to default to (or try at least) until I'm proven wrong by circumstance.

I'd also probably stick some additional ignore patterns in $GIT_DIR/info/exclude (info) to make sure there are no issues with the symlinked items in particular. That way there's no worrying about .gitignore differences between the server and, well, everybody else.

As for the server itself, on that front you definitely have a point. But I can't really think of an instance where I haven't had git on production servers in quite some time. If not for deployments, then for handling repos where I'm compiling from source (here's talking to you, nginx), etckeeper as a bare minimum for tracking conf changes, or heck, just making sure I'm never too far from my beloved little zsh prompt :D.

Access control could be solved with ssh agent forwarding I'd imagine. Before that's in place, maybe just relying on deploy keys if using Github/etc. or if not simply tweaking one's git server for read-only access for servers in question?

Anyhow, I'm interested in what others think on this. I'm an impatient asshole on my best days (I seriously hate sitting at red lights...), so it's possible I'm a bit too focused on wringing out every extra second in my deploy process at the expense of seeing anything else. Those moments I usually need someone to reach up and slap me into normal person land.

jacobat commented 12 years ago

I think this is an interesting conversation and on a big picture note it makes me think that it would be great if Mina would be two things:

A core that provides the script aggregation along with a set of generic tools and data
An extension mechanism that allows for different strategies based on needs and wants

I'm probably overthinking things and I'm not sure if it's worth the effort but it would be nice to have a really good toolbox to build my deployment procedure around. It is of course harder to do it this way than just hacking together a single usecase, but I think in the long run Mina would benefit greatly from this approach.

That said...

During the deploy process, the opportunity for issue with current requests ought to be no different than with any other deploy method (in the split seconds you're checking out the new files once they've already been fetched, your OS is still caching disk i/o -- so I'd think the opportunity for a royal problem here approaches zero) with the biggest delay between checkout and restart taking place during the symlink period, which is where mina's approach really shines regardless).

I think you can get around this issue by having two copies of the repository. I don't remember where reading about this pattern but basically you have a green and a blue and when green is current you deploy to blue and vice versa. That would still give all the benefits of your approach I believe.

I'd also probably stick some additional ignore patterns in $GIT_DIR/info/exclude to make sure there are no issues with the symlinked items in particular. That way there's no worrying about .gitignore differences between the server and, well, everybody else.

This is part of the reason I dislike having the git repository as the working directory. At least I am having trouble imagining all the failure scenarios.

But I can't really think of an instance where I haven't had git on production servers in quite some time.

I think we're getting sidetracked here ;-) Let's just agree that there are multiple ways of managing servers and if Mina could support different ways of doing that it would really be great.

bluestrike2 commented 12 years ago

What? Multiple ways? Have you taken a look at the linux world? Since the dawn of the prompt (nay, man), it's been a well known fact that we each have our own ways of managing servers and ours are always better than everybody else's! Kidding aside, sorry if it sounded like I was brushing that aside - you're absolutely right and I agree (it just didn't sound like it).

The extension idea is actually really straight forward here from my perspective given that Mina is really just a series of rake tasks. Anyhow, I'm going to go off and play a bit later with some ideas on this.

rstacruz commented 12 years ago

We've recently implemented a "locally cached Git repository", quite similar in the same way described here. See #10 for discussions and implementation.

(Disclaimer: Please excuse me if I missed a few of your points, it's a little late here, I'll review your comments further later.) Skimming through the discussion, it seems the debate here is whether the deployed release should be a working Git repository, or should we keep to a "copying strategy" that Capistrano, Vlad (actually vlad-git), and Mina currently does.

We've done this in a similar way Vlad does it, but a little simpler. A bare Git repository clone is kept in #{deploy_to}/scm/, which is considered the "cache".

A deploy will either create this clone, or fetch new objects into it if it exists. Once that succeeds, it clones this into a bare repository.

For reference, Vlad's strategy is to create a working repo (rather than a bare one), updating it as needed, and exporting the contents into the deployed release path. This, I think, fails in the case of having non-fast-forward commits, the most common case being amending a commit in your development copy, and force-pushing it into the remote repo -- something that some people do often in hotfixing.

rstacruz commented 12 years ago

For everyone's quick reference, here's a semi-pseudo-code snippet of the updated Git deploy procedure:

# Creates the `./scm` cache repo.
if [ repository exists ]; then
  echo "-----> Updating repository with new commits"
  cd "#{deploy_to}/scm" && git fetch "#{repository!}" "#{branch}:#{branch}" --force
else
  echo "-----> Cloning the Git repository"
  git clone "#{repository!}" "#{deploy_to}/scm" --bare
fi

# Gets the files into the current build in `./`.
echo "-----> Using git branch '#{branch}'"
git clone "#{deploy_to}/scm" . --depth 1 --recursive --branch "#{branch}"

I've been battle-testing this by doing crazy things (like switching branches, amending/deleting commits, rebasing) and it seems to be holding up just file. Please test and scrutinize away. :)

jacobat commented 12 years ago

I think a better option when getting the files into the current build would be to use git archive. This has two advantages, first it does not create a git repository in the build directory, second it enables excluding files from the current build:

git archive --remote #{deploy_to}/scm #{branch}  | tar -x --exclude=spec --exclude=big-dump.sql -

I'm not sure, but I would expect it to be slightly faster as well as it doesn't have to create the git repository - and potentially write big blobs of data to disk.

rstacruz commented 12 years ago

Cool. I just assumed it wouldn't work on bare repos.

I'll look into it. There's the issue of submodules needing --recursive, but there might be a way around that..

jacobat commented 12 years ago

I'm assuming you're using submodules. I haven't heard talk of submodules in the community in a long time though, so I guess that most people are not using them.

rstacruz commented 12 years ago

I don't, except on my Vim scripts. ;) However, I just don't want to assume no one uses them on their projects.

I do think it'd be fair to have a :use_submodules setting, and have Mina do the git archive way if it's false, and have it do the git clone --recursive way if it's true. It'd be a substantial 3x speedup afterall.

zenom commented 12 years ago

I can say we do use submodules on one project. It is a mongo model directory that is shared amongst different python projects. We use capistrano to deploy it right now. While not as common due to gems, i don't think submodules should be overlooked.

rstacruz commented 12 years ago

Looks pretty good right now, all pending issues stated here have been resolved afaik. Go and try v0.2.1.

mina-deploy / mina

Locally cached git repository #28