laughinghan / git-subhistory

Interchangeably merge in and split out subtree history
41 stars 6 forks source link

Performance #3

Open sergeylukin opened 9 years ago

sergeylukin commented 9 years ago

Hi, I like the idea of subrepo being absorbed into the main that you followed in git-subhistory. I currently use git-subtree for a while to manage 1 shared repo among 3 independent ones. Before I used git-submodule but I didn't like it at all. My shared repo has 560 commits and independent repos have around 1-2k each. Since I passed 1k commits I noticed the performance issue and now pushing shared commit from one of main repos locally takes around 10 sec. Fetching updates from shared repo is actually OK, around 1-2 sec. locally.

How about git-subhistory, is it any better performance-wise? I'm sorry for not checking it myself before asking. I will definitely setup a benchmarking eventually myself.

Thanks

laughinghan commented 9 years ago

Thanks for your interest! Are the repos you refer to publicly available anywhere to test?

git-subhistory as currently implemented definitely performs atrociously, even worse than git-subtree. The strategy I believe should be able to perform quite well with some simple caching, in particular its time should usually be linear only in the number of commits created since the last time you ran git-subhistory, independent of the total number of commits in the repo (hence, it shouldn't experience the problem I think you're describing).

However, I've been toying with an alternative idea (I'll need new name :angry: ) that's slightly less stateless, using the commit-embedded-in-a-tree format that submodules use, that would have identical performance as ordinary git usage. (I'd be happy to explain further if you want to hear about it. This idea is inspired my friend jwmerrill's assertion that the only problem with submodules is the tooling.)

Really though, the main problem is that since switching employers, I don't have a practical application to try out my ideas. Hopefully MathQuill will become mature enough to start splitting out subprojects.

Darthholi commented 7 years ago

Dear Laughinghan, I would wery like to hear further explanations of Your idea (If you dont mind me being a person that is more new to Git than You ;) )

laughinghan commented 7 years ago

So basically there's 2 fundamental data model problems with git-subhistory:

My alternative idea (current name: subcommit) would address both of these but is just overall more complicated. Basically, whereas subtree/subhistory have a normal tree embedded in the Main repo at the subproject path, subcommit would be like submodules and have a commit embedded in the Main repo at the subproject path.

Note that arguably this is purely a workflow/tooling change on top of existing submodules, and could in theory be implemented entirely on top of git-submodules and the other basic git commands by writing my own wrappers for all the basic git commands (git-commit/git-merge/git-status/git-diff/git-checkout/git-stash/etc). This is basically just deep integration between submodules and the basic git commands.

This also has the advantage over subhistory in that the path to the subproject could change over time without destroying subproject history (although that's also addressable by sacrificing the "statelessness" of subhistory).

Open questions about subcommit:

Darthholi commented 7 years ago

Hello! Thank You for Your reply - I got myself some nice time to read over christmas and Iam currentyl studying two projects that do build over subrees - subhistory and https://github.com/ingydotnet/git-subrepo/

I would start from the end of Your post.

Open questions about subcommit:

Iam actually surprised about the possibility to actually use submodules, but the cons are, that there would need to be lots of wrappers coded, as you did write.

My opinion from the user point of view:

How do we manage remotes: doing cd is, I think, the main thing that can be ugly. Imagine many sub-thingies in one repository and the need to cd everytime.

2 fundamental problems

Darthholi commented 7 years ago

Please do allow me a 'Performance' - releated question to understand the whole picture better.

When I do a commit to a uber-project, that affects a sub-project and I have never used a subhistory - Is it correct, that I do need to subhistory split? (to push and init subhistory repo remotely...) Later when I do another commits to uber-project (again affecting sub-project) and I did subhistory split previously, do I need to do split again, or should I write assimilate?

(This is actually a problem that is present in the 'subrepo' solution. For bigger repos it gets annoying how time consuming can full filter branch (imagine subhistory split again and again) get...)

laughinghan commented 7 years ago

@Darthholi: gosh, thanks for actually reading that huge tome I wrote! I kind of spit it all out without spending that much time editing it down (though I originally wrote way more and moved all that into #4, haha)

git-subrepo

Wow, subrepo is still active and maintained! You know, I saw it on HN a while ago but its README was soooo loooong and I couldn't (and still can't) find a succinct explanation of what it actually does, like my README's "what does this actually do" section with analogy to how git log path/to/sub/ lets you see the history of just stuff in path/to/sub/. Do you think you can explain what it actually does? In particular, does it use commit objects in the tree in the underlying git object model like submodule does?

Re: subcommit

When pulling upstream subproject changes: I would love to see all the commits ever pushed to server to be availible. Because when tracking a bug or trying to go back in history, I need to be able to -for example- to run tests on commit by commit basis.

I know, right? People who want to squash can always git merge --squash ASSIMILATE_HEAD anyway. (Or the equivalent for subcommit. That ruins the merge base next time they merge from upstream but that's what you get, you deserve it!)

Imagine many sub-thingies in one repository and the need to cd everytime.

Well it's only every time you have to do something with a remote for a subproject, not other stuff with subprojects. I have difficulty imagining why remotes for a whole bunch of subprojects would all change at once, right? But you're right it's not great.

Re: subhistory

Allow me to say, [not mapping commit messages] can actually be also a feature, not a bug.

Well I'm not proposing automatically transforming your commit messages by default, you'd enable it if you choose (and you'd have to configure whether the prefix is like sctp: or [sctp] or whatever). Also, do you really want, in your SCTP repo, for half of the commits to be prefixed with sctp: like it means something? Shouldn't the special marking on commits that came from another project have the other project's name in them, like appending a line saying Split from linux repo. or something?

When I do a commit to a uber-project, that affects a sub-project and I have never used a subhistory - Is it correct, that I do need to subhistory split? (to push and init subhistory repo remotely...)

Yes. Split and push to the new GitHub remote and boom, you've initialized a subproject repo.

Later when I do another commits to uber-project (again affecting sub-project) and I did subhistory split previously, do I need to do split again, or should I write assimilate?

No assimilate, that's for taking commits to the subproject that weren't in the uber-project, and merging them into the uber-project.

To push these new uber-project commits that affect sub-project, you would split again. The new SPLIT_HEAD will be a fast-forward from your previous split.

Subhistory shouldn't need to filter-branch all of history though (I mean, it does, but it shouldn't need to), and I can't think of any other reason split must get slower as history gets longer. We should be able to cache a mapping from uber-project commits to sub-project commits, and therefore only need to filter-branch the uber-project commits that are new since the last split.

Fascinating that subrepo has this performance problem, still wish I understood what it actually does!

Synthetic merge commits

Funnily enough, I'm actually feeling pretty good about the algorithm for creating synthetic merge commits that I ended up with in #4, it feels like it might actually be pretty much "optimal" in the sense that in every case where there's an obvious right answer to a human about how to merge the Main tree of the synthetic merge commit, this algorithm would pick it, and it would also never pick any obviously wrong answers, all while maintaining the guarantee that, outside the subproject, the Main tree of the synthetic commit is always identical to that of some real, non-synthetic Main commit.

Oh by the way, I thought of a 3rd fundamental data model problem with git-subhistory: changing the path to the subproject in the main project. If any directory there is ever renamed, to say nothing of the subproject being moved to a different directory, then just like git log new/path/to/sub/ would show history stopping at when the subproject was moved to the new path, git-subhistory split (and therefore everything else) would act as if the commit that moved the subproject to the new path was the initial commit of the subproject. There goes the merge bases and fast-forwarding.

This is actually encouraging to me because the way that occurs to me to deal with this is to sacrifice the statelessness and have a mapping from subproject name to path; as long as no commit changes both name and path at the same time, either can change throughout Main repo's history and git-subhistory will still be able to trace the subproject's history correctly.

And if we're sacrificing statelessness, the mapping commit messages problem becomes completely tractable! (I'm thinking git-subhistory allows you to specify arbitrary POSIX-compatible sed scripts to transform commit messages back and forth, with helper options to automatically create such sed scripts for either of the common cases of (a) prefix to first line, or (b) append lines.)

Together that means that maybe instead of 3 fundamental data model problems, perhaps git-subhistory has 0!

This is really exciting to me, I really want to work on this now but I still have the problem of not personally having a real use case! Do you, @Darthholi, have a use case? What about you, @sergeylukin? Care to elaborate on what your use case(s) are? I'd love to work on this with you guys, you could even take over ownership and I could just advise with my midcore git knowledge.

Darthholi commented 7 years ago

Hello!

Soo - subrepo is actually doing the same as subhistory. From a new-person-like-me point of view. It is splitting the history, but it does have some 'stateness' by introducing config fles .gitrepo... I actually do have a feeling, that the features/issues you do talk about are there somehow solved.

But there is also one common problem for both projects - both do filterbranching the whole project everytime (subrepo has some interesting thread https://github.com/ingydotnet/git-subrepo/issues/142 that is about the same issues as this thread :P ). That is not acceptable for me and so I have selected the project with less code to make it suit my needs: https://github.com/Darthholi/git-subhistory/tree/MC-Devel -it is currently in the phase of "is my bash writing without bugs?", but you can look at the diff to see :)

Longer answer:

Re: subcommit Squash/not squash - cool, oki! Cd - subrepo does not need cd :P (but it does not know how to not squash...)

#4 and other things Isnt it all about "do not commit things outside subhistoried folder and inside subhistoried folder at once"? Anyway the cool thing is that I will need more time to even undersand the algorithm ;)

... and usecases and my excited talking Iam happy that you do feel encouraged. Aand maaaybe You will come up with an idea how to use subhistory in Your project....?

Actually I would be most happy if you would look at subrepo (i do suggest the wiki and code), take its inspirations or realize in what it is different. .... And to analyze if it is easier to upgrade subhistory or to upgrade subrepo. Both do the same, the differences are deeper maybe in some things like the merging...

Iam a bash noob (look at my branch), but if You would come up with really clever thingies (sed on commit messages is cool) for subhistory (or, hell, even for subrepo, I dont care) so that all our ideas in this thread would work, I think I can code them with Your help. Or at least test them at 600+ commits/Windows project.

But I still do think that you deserve to be the owner.

If subhistory would get some nice .subhist file, where we would be able to save remotes, and other thingies you do talk about and track the files moving and maybe, just maybe, track more than one folder in one subhist (possibly regexed subflders), I would be happy.

I do have a use case!

Two aready git-ed projects. With some common folders, where the code is exactly the same. The common folders are not standalone compile libraries and will never be (we want it to compile with the main project everytime). Also I will not create special repository for the folders, but a subtree (filterbranched) in the main repo is enough. Before git, we were copying the contents. With git Iam copying them and commiting (essentialy creating squash commits both ways). I want to have full commit history, but I dont want to wait for filterbranch 2600 seconds :D Also I want to initialize the sub-something (with the history of one of the projects enough).

Darthholi commented 7 years ago

I have solved my bash problems and added a shortcut for future splits - subhistory/start/$newbranch.

Now I need Your wisdom to foresee if there can be any problems with this strategy. (+Also all the things from the previous post :) ) And i will use my own wisdom to get to knowwhat #4 means.

laughinghan commented 7 years ago

Iam a bash noob (look at my branch), but if You would come up with really clever thingies (sed on commit messages is cool) for subhistory (or, hell, even for subrepo, I dont care) so that all our ideas in this thread would work, I think I can code them with Your help. Or at least test them at 600+ commits/Windows project.

But I still do think that you deserve to be the owner.

Cool! I'll respond here first, then take a deeper look at your branch. Less familiarity with Git and shell scripting is fine, just having a collaborator and user is already supremely helpful.

"is my bash writing without bugs?"

Git actually requires shell scripts to be written in a cross-compatible shell syntax that is mostly a subset of POSIX (which I think is a superset of Bourne, but is much smaller than Bash): https://github.com/git/git/blob/master/Documentation/CodingGuidelines

I should probably mention that in the README or something somewhere.

Isnt it all about "do not commit things outside subhistoried folder and inside subhistoried folder at once"?

That's actually no big deal, git-subhistory has no problem splitting out such a commit and mapping back to it (the commit does have to be in HEAD to be split out and mapped back to in the first place, though).

4 is actually mostly about the edge case of merge conflicts in synthetic merge commits, which can only happen due to criss-crossed merges:

---A---1---M                       ---1'---M'---o [subproj]
        \ /                               /
         X                               /
        / \                             /
---B---2---o---o [master] [HEAD]   ---2'

Here 1 and 2 are fixes to the subproject that were split out as 1' and 2', which were merged to form M'. Now we want to merge subproj back in, including M', so we map 1' and 2' back to 1 and 2 and want to generate synthetic merge commit M. The problem is, what do we do if A and B have conflicting changes outside the subproject? Current answer: we ignore A and B and use the tree on HEAD, since that's guaranteed not to conflict with itself when we merge. Answer proposed in #4: ignore all but first parent of M, ignore tree outside subproject when merging into HEAD.

If subhistory would get some nice .subhist file, where we would be able to save remotes,

git-subhistory doesn't need to save remotes, the branch created with -b (subproj in the README example) is a normal branch, you can push and pull subproj to a normal remote. I've significantly revamped the README to use ASCII art illustrations, does the new README make this clearer do you think?

maybe, just maybe, track more than one folder in one subhist (possibly regexed subflders)

Hmmm, can you explain further? I'm pretty hesitant about this. If you split out this subhistory, do you get multiple commit histories, one per folder, or are the folders are merged somehow?

Actually I would be most happy if you would look at subrepo (i do suggest the wiki and code), take its inspirations or realize in what it is different. .... And to analyze if it is easier to upgrade subhistory or to upgrade subrepo. Both do the same, the differences are deeper maybe in some things like the merging...

So I went ahead and read up on subrepo and I think I kind of understand it now, and it turns out the differences are quite deep, even though they're superficially similar. In fact, I think it would be fair to say that the workflow that subrepo and subhistory strive for pretty much the same, so it's by design that they're superficially similar, but the technical approach is philosophically the stark opposite.

git-subhistory is all about the data model. By doing all this upfront work of ensuring the data model plays nice with Git's elegant commit graph data model (or if you actually want to understand Git's data model, this and this are how I came to appreciate it), all of Git's normal graph traversal algorithms to find appropriate merge bases and determine fast-forwardness etc all Just Work™.

git-subrepo is all about the workflow. By designing the tool to encompass every interaction the workflow may have with the data model, the tool can lie, cheat, and steal when it comes to the data model and enable the desired workflow by any means necessary. This isn't a criticism, it's a tradeoff, obviously git-subrepo is already much more useful than git-subhistory.

The downside is that it necessitates reinventing a custom merge base algorithm, and using rebase to fix the broken history when pushing or pulling. Rebasing like that elides merge commits, for one thing, although it might not be that often you intentionally want to push a commit history including merges. The custom merge base algorithm, at least as currently implemented, can only find one merge base even if there should be multiple due to criss-cross merges as mentioned above, merging based on that has problems compared to merging using Git's default "recursive" merge strategy: http://blog.plasticscm.com/2012/01/more-on-recursive-merge-strategy.html

I don't think using my ideas to upgrade git-subrepo can work, but I'm pretty optimistic about forging ahead with our nice clean data model.

By the way, do you have a link to where they're discussing not squashing commits when merging? As far as I can tell they don't have anything like assimilate, they do the rebasing with Sub commits not Main commits, and they would need either something like assimilate or merge -s subtree like git-subtree does in order to merge without squashing.

filterbranch 2600 seconds

Oh my! That's definitely unacceptable.

a shortcut for future splits - subhistory/start/$newbranch

Hmmm, my main concern would be what if the current branch is reset to not a fast-forward, but I'll try to understand your code and get back to you.

Darthholi commented 7 years ago

Hi! I love the new readme! Actually I would need the same readme here because I do not understand in "what do we do if A and B have conflicting changes outside the subproject?" the part about conflicting changes outside of subproject. I need an example. What does it mean 'outside'? It can be in the form of unit test which will be needed anyway for the new code I suppose.

doesn't need to save remotes,

Cool! Clever!

track more than one folder in one subhist

Lets say that my library is defined not only by subfolder but by a name scheme (/subfolder/ -> /sub folder/ *). One superproject just needs to use the naming convention but the library essentialy 2 subfolders. It is a feature that is not essential of course :) I just set my fantasy free :)

subrepo - differences are quite deep

Oki Iam happy that you do see it! I saw only the shallow similarity. So now I say that I do devote myself to subhistory!

The squashing nonsquashing is here - "regarding the squash/no-squash it might be goo to have that optional on a subrepo basis." - https://github.com/ingydotnet/git-subrepo/issues/142

a shortcut for future splits - subhistory/start/$newbranch

I tried to add some checking if we can do it all faster. If you do see some cases when we cannot please do tell me :)

About my usecase:

Actually now it is getting even more interesting! I need to share code in such a way that in repo A the subfolder has UTF-like encoding in .dfm files and in repo B the subfolder has not-utf encoding (c++ builder if you guessed). Everything else is the same. I would love to turn both subdirectories to subhistory and merge A to B and B to A and solve conflicts by hand. (First time lots of conflicts and then everytime anything changes in any .dfm) ... And then I would actually keep both branches as subhistories (in both repos). Any work done to .cpp and .h files would distribute and any work done to .dfm files would throw a conflict which I would solve by hand (sounds bad but will happen just sometimes). Would Your algorithm for #4 make it possible to keep it like this?

(sry for the absence of commas in this text I seem to have a broken keyboard)

laughinghan commented 7 years ago

I do not understand in "what do we do if A and B have conflicting changes outside the subproject?" the part about conflicting changes outside of subproject. I need an example. What does it mean 'outside'? It can be in the form of unit test which will be needed anyway for the new code I suppose.

No, not unit test, I mean like in the Main repo outside the Sub folder. For example (this is upside-down from the criss-crossed merges example I gave above, HEAD is on top rather than on bottom):

                                                                                                                                                                                                  [HEAD]
[initial commit]                                                                                                                                                                                  [master]
o-------------------------------o-------------------------------o-------------------------------o------------------------------------------------------------------------------┳-------------------o
|                               |\                              |                               |                                                                               \ /---------------/|                                [ASSIMILATE_HEAD]
|                               | \                             |                               |                                                                                X                 |                                |
|                               |  \                            |                               |                                                                               / \----------------|-------------------------------\|
|                               |   \---------------------------|-------------------------------|--------------------------------o--------------------------------o------------┻-------------------|--------------------------------o
Add a Main thing                Add a Sub thing                 Set Main thing to "foo"         Fix Sub somehow                  Set Main thing to "bar"          Fix Sub some other how           Merge branch 'set-thing-to-bar'  Merge Sub fixes
 __________________________      __________________________      __________________________      __________________________       __________________________       __________________________       __________________________       __________________________
|                          |    |                          |    |                          |    |                          |     |                          |     |                          |     |                          |     |                          |
|  Files:                  |    |  Files:                  |    |  Files:                  |    |  Files:                  |     |  Files:                  |     |  Files:                  |     |  Files:                  |     |  Files:                  |
|  + a-Main-thing          |    |    a-Main-thing          |    |  ~ a-Main-thing          |    |    a-Main-thing          |     |  ~ a-Main-thing          |     |    a-Main-thing          |     |  ! a-Main-thing          |     |  ! a-Main-thing          |
|                          |    |  + path/to/sub/          |    |    path/to/sub/          |    |    path/to/sub/          |     |    path/to/sub/          |     |    path/to/sub/          |     |    path/to/sub/          |     |    path/to/sub/          |
|                          |    |  +   a-Sub-thing         |    |      a-Sub-thing         |    |      a-Sub-thing         |     |      a-Sub-thing         |     |      a-Sub-thing         |     |      a-Sub-thing         |     |      a-Sub-thing         |
|                          |    |                          |    |                          |    |  +   fix-Sub-somehow     |     |                          |     |  +   fix-Sub-other       |     |  <   fix-Sub-somehow     |     |  <   fix-Sub-somehow     |
|                          |    |                          |    |                          |    |                          |     |                          |     |                          |     |  >   fix-Sub-further     |     |  >   fix-Sub-further     |
|                          |    |                          |    |                          |    |                          |     |                          |     |                          |     |                          |     |                          |
|__________________________|    |__________________________|    |__________________________|    |__________________________|     |__________________________|     |__________________________|     |__________________________|     |__________________________|

When creating the synthetic commit Merge Sub fixes, how do we resolve the conflict in the file a-Main-thing, which one branch changed to "foo" and the other to "bar"?

Lets say that my library is defined not only by subfolder but by a name scheme (/subfolder/ -> /subfolder/*). One superproject just needs to use the naming convention but the library essentialy 2 subfolders.

I still don't understand—what does the separate repo for the subproject look like?

I tried to add some checking if we can do it all faster. If you do see some cases when we cannot please do tell me :)

Hmm, I see. Do you think you can comment on your PR describing at a high level the changes you made in order to do this?

I need to share code in such a way that in repo A the subfolder has UTF-like encoding in .dfm files and in repo B the subfolder has not-utf encoding

Well obviously you're gonna have to pick just one of those encodings for the shared Git commit history between A and B. But you can add a build step to one or both of them that re-encodes the files into whatever you need it to be, I guess?

git-subhistory should already work for that, other than the filter-branch performance problems.

Darthholi commented 7 years ago

I still don't understand—what does the separate repo for the subproject look like?

Sorry, the github made my asterisk character look like formatting. So the idea was just that I might use to make a subhistory not only of one subfolder, but two subfolders (generalizes from /subfolder/asterisk files included in subhistory to /subfolder asterisk / asterisk ... in the case of subfolder1 and subfolder2 it would include both). They are not two separate subrepositories, it is just that these two folders do contain the code for one library actually. This usecase is obviously because the company project manager refuses to copy the files from two folders into one. As such I would repeat that it is not that important, It can be used as two separate subhistoried folders.

example

Ok. Thank you for the example, I see that merging these two branches would throw a conflict and so assimilationg has hard time too, but for me there is one thing misleading (and I think that Iam unclear in expressing how exactly). Does there happen to be a split? Does the split happen at (just after) "Add sub thing"? If yes, then I do not understand how can we "set the main thing to bar" while working on a sub project = if we are in split branch. (If we do a split, then the split branch produced by subhistory does not contain things outside sub) Where is the root of my misunderstanding?

Darthholi commented 7 years ago

Ok, reminding myself we are talking about synthetic commits. I will try to read it all once again :)

laughinghan commented 7 years ago

They are not two separate subrepositories, it is just that these two folders do contain the code for one library actually.

Unfortunately a git commit history has to have just one root folder, although you could probably symlink the two folders into folders in the subhistory folder.

reminding myself we are talking about synthetic commits

Oh, in that example the only synthetic commit is Merge Sub fixes.

Does the split happen at (just after) "Add sub thing"?

No, there were two separate splits. Imagine that first, when the master branch was on Fix Sub somehow, it was split out and pushed to its own repo (or just included in another project). Then second, when the set-thing-to-bar branch was on Fix Sub some other how, that was also split out and was submitted as a PR and merged into that other repo. Finally set-thing-to-bar was merged into master, and now we want to merge in something from the Sub project, which requires assimilating that merge commit.

laughinghan commented 7 years ago

I'm still grokking your PR but I was thinking about the idea of caching a separate map between Main commits and Sub commits, and I was thinking about how git push and git fetch can be used to share any ref under refs/ created by git update-ref (which is how sharing Git Notes works, for example) and how we might be able to use that to share the map between Main commits and Sub commits, and I might have stumbled on an idea to merge my subcommit ideas into this.

Basically instead of being embedded in the tree, the Sub commit corresponding to the Main commit is in a ref, which we ensure is always pushed to and pulled from the remote whenever the corresponding Main commit is: #8