History-Preserving Modularization

dabrahams commented 14 years ago

Boost has already been modularized in http://gitorious.org/boost. However, these modules don't bring along the Boost development history.

I think if we want to keep the history, we need to remake each of these repos a clone of http://gitorious.org/boost/svn, which is automatically tracking our SVN repository, then make the changes required to modularize the repo, so we'll be able to continue pulling changes in from our Git SVN mirror. I think it's important to make these changes fine-grained, so you don't move/rename and modify a file in a single commit, so that Git can succeed with later merges. It would be a good idea if each module contained a script (or other record) of the exact steps required to create and update it.

dabrahams commented 14 years ago

troy remarks:

You're going to have 100-some repositories in your rypplized boost, each of which is >100M in size. I don't think I'm missing anything here.

dabrahams commented 14 years ago

I think I see what you mean, and in the back of my mind I anticipated this problem. I guess one possibility would be to use git-svn with the --ignore-paths option. I wonder if there's also a way to strip out all the irrelevant stuff from the git repo. I'm guessing not :-/

bradking commented 14 years ago

In older versions of boost the modules were not so cleanly separated, right? This means that their development histories were intertwined and only recent versions can be separated.

For full preservation, each repository would have the full history of all boost right up until the modularization point, at which time the beginning of each module would be a commit that moves things into place and removes everything else. This commit would be separate for each module. After that you could merge updated changes from boost svn and Git may be able to follow the renames and resolve the merge (mostly) automatically.

Just so I'm clear, Troy's concern is that the pre-modularization history would be duplicated in every module's repository?

This is related to an alternative use case for Git submodules which has been brought up on Git's mailing list many times (in slight variations). Currently the "git submodule" command is meant for a super-repository that refers to sub-repositories that each have completely distinct content. Another valid use case that is not supported right now is when the super-project shares content with the subproject (we have this problem at Kitware because ParaView shares VTK content and does not build without it). Boost modules have this problem with a slight twist: the modules share content historically but not in their modern revisions.

In the shared-content submodule use-case it makes more sense for the objects of the submodule to be stored in the same .git/objects database as the outer project. In the Boost case this would avoid duplicating all the old history in each module's repository while allowing it to appear so conceptually.

I've spent some time investigating how to do this with Git, and it is possible. However, it requires deep understanding of Git and careful manual set up of the work tree (and requires symlinks in some cases). More work is needed in Git proper to provide a nice interface for it. When I get more time to spend on it I plan to propose a solution to Git upstream.

dabrahams commented 14 years ago

git filter-branch is one way to do this. But perhaps even better, we could use the technique described in http://progit.org//2010/03/17/replace.html

So cool!

dabrahams commented 14 years ago

Specifically, see http://progit.org//2010/03/17/replace.html?dsq=41051056#comment-41051056

bradking commented 14 years ago

Yes, for stripping irrelevant stuff from history, filter-branch is the way to go. I'm very familiar with it because I've been using it extensively to do manual cleanup of automatic cvs->git conversion results for permanent one-time conversions. I'm happy to answer any questions about it.

The "replace" approach looks like a porcelain around Git grafts. I've also used grafts extensively during cvs->git conversion cleanup. Basically you can just put a .git/info/grafts file in your repo. Each line is the hash of a commit followed by the hashes of other commits to pretend are the parents (and ignore the real parents). Note that grafts are meaningful only in the local repository. They can also break push/fetch if a transmitted commit is grafted.

Grafts are certainly a reasonable approach for the modularization-with-history problem. Just be sure that the graft for the first commit in each module points its pretend parent at a commit in the non-modularized history that has the same content for each file. Otherwise Git may not track the renames correctly (it can tolerate some edits with renames but disables this feature when more paths change than some threshold). Unfortunately it is up to each user that wants full history to fetch it from the full repo and add the correct graft locally. Perhaps a script or at least some documentation for each module can specify the proper graft line.

dabrahams commented 14 years ago

I was thinking of doing it this way; comments appreciated:

set up a Git mirror of the SVN repo. Fortunately, Troy has already done that and will soon be pushing commits out to http://github.com/ryppl/boost-svn
create a branch for each library
in each library branch, delete the irrelevant files and move them into the new structure appropriate to the individual library
use that state as a graftable base for the Git module repo
to keep in synch, most updates to the mirrored SVN can probably be merged cleanly and automatically into the library branches
but how do we get those changes into the Git module repo?

daniel-w commented 14 years ago

As long as the SVN repo is active, no one can really use the modularized git repositories for development anyway, so why is this important? When doing this as a one time thing for the final conversion, we can do it with grafts or possibly just use filter-branch to clean up the history and get rid of unrelated changes. But why do it now? What am I missing?

Also, I'm not sure about (5). Sure, merges MIGHT work, but they will fail when new files are added or old files are deleted, what do we do when that happens? Also the history will be full of merge commits, one every time we sync, which I guess is for every commit if it's automatic. There doesn't seem to be any point in doing merges here rather than just rebuilding the branch anyway, since the actual modularized repo (which doesn't have any history) would have its branch head reset anyway.

dabrahams commented 14 years ago

As long as the SVN repo is active, no one can really use the modularized git repositories for development anyway

Depends what you mean by "development." We can work on the build system, work on the testing system, update CMakeLists.txt files, keep things in synch, etc.

I'm not sure whether the modularized repo would have its branch head reset. But in any case, please make a specific recommendation if you don't like this plan. I can't easily evaluate the consequences of your objections. One of my goals is to not take all of Boost "offline" for a week to do a conversion.

daniel-w commented 14 years ago

Depends what you mean by "development." We can work on the build system, work on the testing system, update CMakeLists.txt files, keep things in synch, etc.

Right, I meant anything that merges back to the SVN repo.

I'm not sure whether the modularized repo would have its branch head reset.

It doesn't carry any history, so there's nothing else you can do. There's no commit to merge in, so you'd have to rebuild it from scratch from the modularized source and then graft it on top of the new branch head from the repo that DOES carry history.

But in any case, please make a specific recommendation if you don't like this plan. I can't easily evaluate the consequences of your objections. One of my goals is to not take all of Boost "offline" for a week to do a conversion.

OK. My point was that maybe we don't have to do anything fancy. Just:

1) Update the SVN mirror from upstream. 2) For each library, rebuild the modularized content and; a) Rebuild the library master head based on the new code, or.. b) Construct a commit from the changes introduced by the new modularization.

When the final conversion happens we can use filter-branch and do a complicated conversion that takes a week, and then just have every branch rebased on top of that.

dabrahams commented 14 years ago

I had never intended to do anything that goes back into SVN.

Everything you wrote after 1) above lacks enough detail to be sure I understand it:

what does "rebuild the modularized content" mean?
what does "rebuild the library master head based on the new code" mean?
what does "construct a commit from the changes introduced by the new modularization" mean?
what kind of complicated filter-branch conversion are you suggesting?
if it takes a week, how will we avoid taking Boost "offline?"

daniel-w commented 14 years ago

Small update. Dave and I talked off-ticket about this, and here's a small drawing to clarify what the original plan was:

1 ---> 2 ---> 3 ---> 4 ---> 5 ---> 6 ---> ...      MASTER BRANCH ON SVN MIRROR
        \      \      \      \      \
         \      \      \      \      \
          A ---> B ---> C ---> D ---> E ---> ...   LIBRARY BRANCH ON SVN MIRROR
          :      :      :      :
          :      :      :      :
          A1 --> B1 --> C1 --> D1                  LIBRARY REPO

(1) is some arbitrary old commit in the repo, perhaps the first commit ever.

(2) is subversion HEAD as of today, the state where we start from.

(A) is the modularized state for HEAD as of today.

(B), (C), etc are the merges that sync with update subversion.

(A1), (B1).. are the tree state of (A), (B).. in the new library repository. The dotted lines represent graft relationships.

ericniebler commented 14 years ago

I have asked about our conversion to git on the git mailing list. The thread is here:

http://thread.gmane.org/gmane.comp.version-control.git/150270

The suggestions so far include git-filter-branch, --tree-filter, and svn2git. The later looks like an interesting suggestion, and is how KDE migrated. Like ours, their migration also included a refactorization into separate repositories. Anybody have any experience with it?

dabrahams commented 14 years ago

The suggestions so far include git-filter-branch, --tree-filter, and svn2git.

These are in three different categories:

git-filter-branch: a git command that writes filtered copies of the commits on a branch in a given repo and resets the branch head
--tree-filter: an option to said command
svn2git: a tool that converts SVN history to Git history

It seems, whatever else happens, one has to start with #3 or some equivalent.

The only reason I can see to use filter-branch would be to rewrite history so that each library's own repo gets a history that contains only the files owned by that library. However, doing that correctly seems difficult at best and even if we could do it, I don't think the result would be all that useful, because it wouldn't reflect the true nature of pre-modularized boost: there are some tangled dependencies, and occasionally sweeping changes are made by one person across several libraries.

In this thread I've been suggesting that each library's Git history include the un-modularized state and the modularization changes (file moves and deletes that take us from un-modularized to modularized). One way we could do that is to start every library's Git repo from a clone of the SVN mirror, but that would be very inefficient for anyone with multiple boost library repositories on his machine. So instead I think we should have the first commit of every library repo look identical: a snapshot of the latest SVN state (no history). Then, if someone wants to see further back in a given library's history, he can graft on the changes from the final state of the SVN mirror.

Make sense?

ericniebler commented 14 years ago

I think so, but you and I should get together so I can be sure I'm understanding. IIUC, you'd like to leave ancient (pre-modularlized) history in a frozen repository cloned from boost svn. And when you pull down an individual library, you still pull down all of boost, but minus the history. That means pulling down all of boost means pulling down boost >100 times (without history), is that right? I think I'm still confused.

dabrahams commented 14 years ago

You're very close. "You still pull down all of boost" is true in a sense; i.e. the repository would contain a copy of just the latest state of every boost file. However, Git is really efficient at storing things so that probably wouldn't take up much space. Those files wouldn't exist in a typical working copy.

Well, now that you mention it, Boost is quite large, even without the history. I suppose the optimal solution would be to move the graft point forward in time by one commit. So:

The frozen repo adds to the final SVN state a branch from trunk for each library, whose only commit distinct from trunk is that library's modularization step. This branch forms the basis for any graft that might be needed.

Howzat?

ericniebler commented 14 years ago

OK. So pre-modularized boost lives on a server in the sky (github?) and is never downloaded, ever? But users have the options of adding grafts in their local repro pointing to the server in the sky -- in fact, each library would point to its own branch on the server in the sky. Have I got that right?

"Going up to the Server in the Sky. It where I'm gonna go when I die...." (with apologies to Norman Greenbaum.

I didn't understand the first bit, though. In the scheme as I described it, you said git would only save the pre-modularlized boost locally once (because they have the same hash). But that's only if library X and library Y share the same object store on the local machine. If I checked out library X into directory A and library Y into directory B, I'd still get two full copies of boost.

dabrahams commented 14 years ago

Whether or not the pre-modularized boost repo gets cloned just depends if anyone is interested. But grafts don't "point at servers," or we would be able to build the grafts into the original library repos. Developers would have the option of fetching from pre-modularized boost (i.e. pulling one or more branches into their local modularized repo) and grafting the initial commit in their modularized repo onto the tip of one of those branches.

You'd only get two full copies of boost if you decided to graft on history in both repos, but nobody will do that. Grafting is just something you'd do for exploring history on a local repository. Nobody will be pushing boost's history into the master repository of an individual library, so nobody will get that history in their local clone automatically.

ericniebler commented 14 years ago

".. graft on history in both repos..." I don't know what two repos you're referring to. I think we had better save this discussion until we're face to face. It's not getting any clearer for me.

dabrahams commented 14 years ago

For everyone else and the sake of posterity: I mean the repos for X and Y that are in directories A and B.

ghost commented 14 years ago

If I'm understanding how things should work, each library's git repository will contain a branch called 'history' (or something similar) which contains all the pre-modularized boost history, and the library repo's master's history will be rewritten to just have the first commit after the pre-modularized boost have a dummy parent. Now in case anyone wants to view the full history, they will then have to fetch the branch from Github, and then use git-remove to link the first commit after the pre-modularized boost to the history's HEAD, which eventually shows the linear history.

I just read http://progit.org//2010/03/17/replace.html?dsq=41051056#comment-41051056 Dave, and unfortunately when you clone a repo, it's going to do just that, clone whatever is in the repo. Maybe compressing the repositories would be an option to help with the large histories, but the Linux kernel development team doesn't seem to mind. Besides, I'm not sure if Github supports aggressive compression of the repositories on their end anyway so any gains with repository compression would only be local.

dabrahams commented 14 years ago

If I'm understanding how things should work, each library's git repository will contain a branch called 'history' (or something similar) which contains all the pre-modularized boost history,

No, that's exactly what we want to avoid as noted above. 100 boost library clones means storing all of boost's history 100x.

I just read http://progit.org//2010/03/17/replace.html?dsq=41051056#comment-41051056 Dave, and unfortunately when you clone a repo, it's going to do just that, clone whatever is in the repo.

This is not news to me, which is why I wrote that “It would be interesting if Git had a way...”

Maybe compressing the repositories would be an option to help with the large histories, but the Linux kernel development team doesn't seem to mind. Besides, I'm not sure if Github supports aggressive compression of the repositories on their end anyway so any gains with repository compression would only be local.

Local is the only important consideration unless we fear exceeding GitHub's storage limits for free repos

bradking commented 14 years ago

IIRC, Git does not clone refs outside of refs/heads/ and refs/tags/ by default. You can push the 'history' branch to a non-standard ref in the main repo:

git push origin history:refs/ancient/history

Others that clone the main repo can do

git fetch origin refs/ancient/history:refs/ancient/history

to get the objects, and then add the graft.

dabrahams commented 14 years ago

Oh, that is pretty cool. Thanks, Brad!

ericniebler commented 14 years ago

set up a live clone of boost svn in a git repo (A) (done)

develop a script that modularizes boost. Test locally against repo (A). The script will:

clone (A) (locally)
create a branch for each library. For each branch:
- delete everything
- add back the modularized state for that library and Troy's CMake stuff
- Commit!
Add a pretty tag to each branch so we can easily graft to it.
push the 'history' branch to a non-standard ref (git push origin history:refs/ancient/history)
For each branch, use latest state to initialize a fresh repo for each library
push each repo to [local clone | github]
Add ryppl metadata to each library pointing to the repos of its dependencies.
Commit each library.

Create a boost ryppl project

Add ryppl metadata pointing to the repositories of boost libraries

Now:

'ryppl install boost' will pull down boost and its libraries.
Optional: fetch ancient history branch and add grafts for each library.
???
Profit!

dabrahams commented 14 years ago

Cool! What tool did you use to create the live clone and how are you keeping it in sync?

ericniebler commented 14 years ago

Ha! You misunderstand. This is my TODO list. I was asking for feedback about whether these are the right things to do, and if they're in the right order.

dabrahams commented 14 years ago

OK, but what does “done” mean in

set up a live clone of boost svn in a git repo (A) (done)

ericniebler commented 14 years ago

troy did that already.

dabrahams commented 14 years ago

Oh, yeah, but it's incomplete IIRC. Only tracks trunk and release, right?

bradking commented 14 years ago

I suggest creating a history repository on github for the "history" branch as a normal head. Then fork that to create each individual library repository. After forking, then move the history branch to refs/ancient/history. Finally, leave only the modularized history in each repo's refs/heads/.

This approach should help github re-use disk space for all the ancient history objects. It will also provide a first-class historical reference repository. However, I'm not sure off the top of my head what other effects on the apparent organization that might have.

ericniebler commented 14 years ago

OK, this sounds good. Thanks Brad.

ghost commented 14 years ago

That's a neat idea, thanks for sharing Brad!

dabrahams commented 14 years ago

True, it's a neat idea, but after some consideration I'm not sure we get much of an advantage by having ancient history in each library's repo. The user is going to have to fetch those commits explicitly and make a graft either way, i.e. the average user will need instructions. I don't think those would be simplified much by not having to reference the ancient history remote.

bradking commented 13 years ago

Agreed. It makes more sense to fetch directly from the ancient history repository when the history is needed. Here is the approach I mentioned during our meeting a couple weeks ago.

A graft is just a line in the local ".git/info/grafts" file with the format

  # A -> B (this line is a comment)
  aaaaaaaa bbbbbbbb

where "aaaaaaaa" is the 40-byte SHA-1 of commit A, "bbbbbbbb" is the 40-byte SHA-1 of commit B, and the goal is to pretend that B is the parent of A. The real parents of A are ignored.

In our use case, commit A is the root commit of one Boost module, and commit B is a commit in the monolithic ancient history. Somewhere we must provide the graft file entry for A -> B to users of the module. We cannot provide it in the tree object of commit A because the entry must be aware of A's commit hash. Since the entry is useless without a copy of the ancient history objects anyway, I suggest we provide the graft entries somewhere in the ancient history repository.

The set of modules that can be extracted from the monolithic source is finite and known, so the grafts entries will be a one-time addition to the ancient history repository after modularization. We can just put the graft entries for all the modules into one file. Extra entries will not hurt, and it gives us a single blob object that can be used as .git/info/grafts for any module's repository. The procedure to fetch and graft history for any module can be just:

 $ git remote add history git://somewhere/boost-history.git
 $ git fetch history
 $ git show history/master:grafts > .git/info/grafts

This assumes that the "master" branch of the history repository has a tree object containing a file "grafts" file with appropriate entries. I suggest we construct this as a single commit on the pre-modularized end of history that removes all the files and adds the grafts file and a README. This guarantees that anyone who fetches the master branch from boost-history.git to get the grafts file will get the history too.

dabrahams commented 13 years ago

perfect.

ryppl / __legacy

History-Preserving Modularization #4