git-fat pull performance

duanem commented 10 years ago

Our git repository has almost 31K commits spanning ten years of development (yes, we imported from another system. :).

git fat pull can take an extraordinarily long time to execute before getting to rsync-ing the files. git fat pull can take several hours on a branch new repository. An already updated repository can take 10-20 minutes to discover that it has nothing to do.

I'd like to encourage discussion about how we can make this process faster.

Perhaps some method of not investigating every single commit. Keeping tabs about where the last scan was successful. Just suggestions, I haven't dived into the code as of yet.

jedbrown commented 10 years ago

Does git fat status take a similarly long time? What about git fat status HEAD^..?

duanem commented 10 years ago

Some results

System	status	status HEAD^..	pull
Linux	14.801s	2.949s	5.084s
mingw	3m3.490s	2m24.370s	5m5.907s

I used the time command. For example: time git fat status. The results shown are the real time from the time report.

Linux: Linux 2.6.32-431.17.1.el6.x86_64 #1 SMP Wed May 7 23:32:49 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux mingw: MINGW32_NT-6.1 1.0.18(0.48/3/2) 2013-10-26 21:23 i686 Msys

While this doesn't show the longer times I have been experiencing, 5 minutes is still painfully too long time wait. I have experienced times that are much longer.

jedbrown commented 10 years ago

What about git fat pull HEAD^..?

You can build the pipeline yourself outside of git-fat, add instrumentation to git-fat to profile the threads independently, or profile using "perf" (on Linux). I think most of the time is spent in the pipeline built by referenced_objects, which needs to look at a lot of objects when going through complete history. If you have a large repository, you may find that git gc improves performance significantly. The object size is not included in the index file, so it has to seek into the pack file. This is very slow if you have a spinning disk and large packfiles.

Index files are sorted by object name while packfiles are typically sorted with most interesting objects first. It may be that batching up a bunch of objects, sorting them, finding their offsets, then resorting by object would improve performance. But this would be more of a git-core optimization and let's not jump to conclusions. Do you have time to do some more detailed profiling of where the time is being spent?

I'm not wild about caching this information, so I'd rather try to optimize what is done and the access pattern first.

Duane Murphy notifications@github.com writes:

Some results

System status status HEAD^.. pull

Linux 14.801s 2.949s 5.084s

mingw 3m3.490s 2m24.370s 5m5.907s

I used the time command. For example: time git fat status. The results shown are the real time from the time report.

Linux: Linux 2.6.32-431.17.1.el6.x86_64 #1 SMP Wed May 7 23:32:49 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux mingw: MINGW32_NT-6.1 1.0.18(0.48/3/2) 2013-10-26 21:23 i686 Msys

While this doesn't show the longer times I have been experiencing, 5 minutes is still painfully too long time wait. I have experienced times that are much longer.

Reply to this email directly or view it on GitHub: https://github.com/jedbrown/git-fat/issues/48#issuecomment-51945372

duanem commented 10 years ago

Our tests have shown that indeed the time is spent in the pipe for referenced_objects. We have little doubt.

I was hoping to encourage the investigation of a procedure that would allow a smaller set of commits to be searched rather then the entire repository.

For example, I expect that git doesn't transfer the entire status of the repository when doing a git pull. I expect there is some kind of optimization (I could be completely wrong here. :smile:).

Interestingly, a gc caused 5 minutes to go to 6 minutes. :smile:

jedbrown commented 10 years ago

That was HEAD^.., which shows the latest commit (well, more for a merge). Maybe you want something more nuanced?

git pull can be smarter because it only searches commits and has a DAG available. We have an unstructured bag of objects (the whole point was to avoid being part of Git's DAG).

duanem commented 10 years ago

That was HEAD^..

I haven't looked at the code specifically in this area. Can you describe what this does to help speed up the process?

For example, we've noticed that git fat checkout is quite fast, but git fat checkout is only concerned with the current branch rather than the entire repository. (I understand that the entire repository is appropriate for git fat pull).

jedbrown commented 10 years ago

Duane Murphy notifications@github.com writes:

That was HEAD^..

I haven't looked at the code specifically in this area. Can you describe what this does to help speed up the process?

Compare git log HEAD^.. to git log. This limiting has nothing to do with git-fat, just the way Git parses ranges (first parent of HEAD up to HEAD).

jedbrown / git-fat

git-fat pull performance #48