Improve performance when looking up referenced objects.

jmurty commented 10 years ago

Avoid a cat-file subprocess call per fat object blob by doing slightly uglier parsing of "cat-file --batch" that includes object content, instead of "cat-file --batch-check" that doesn't.

In my ad-hoc testing on a reasonable size repository (50k rev-list -objects) this speeds up 'git fat status' by almost 40%.

jedbrown commented 10 years ago

This is an interesting approach, but I'm concerned about what happens for repositories with larger file sizes. For example, I tried running on a repository with 80k objects ranging in size up to a few megabytes (a total of 10 files over 1M) and the original version takes 1.8 seconds while the new version takes 26 seconds. I would prefer this scalability to faster performance only in a special case.

What I think would work better is to use cat-file --batch-check and make a list of only those blobs that match the magic size. Then send the full list to cat-file --batch so that you only generate output for files that match the size criteria.

jmurty commented 10 years ago

I see. This approach is very inefficient as it ends up reading (and mostly ignoring) the contents of all git objects in the system. The local repo I was testing with isn't a good general example as it doesn't contain large objects (they have all been git-fat-ed) and I'm fortunate to be using an SSD drive which makes IO-heavy work relatively painless compared to the subprocess calls.

Leave it with me and I will adapt this PR to take the approach you suggest:

Use cat-file --batch-check initially to make a list of only those blobs that match the magic size, as candidate git-fat object references
Then send the candidate list to cat-file --batch to read the contents of candidates, and handle those that are actual references.

I will also test against both a pre- and post-git-fat-ed repository to get a better idea of best/worst case performance effects.

jedbrown commented 10 years ago

Sounds great, thanks!

jmurty commented 10 years ago

Intelligently combining the cat-file --batch-check and --batch commands has fixed the terrible worst-case performance of my prior attempt, and as a bonus is even faster on my test repository with a full order-of-magnitude speed boost.

Hopefully the improvements will also apply to other repositories on other systems.

obazoud commented 10 years ago

Hi folks,

git-fat uses the following algorithm:

p1 | cut_thread | p2 | filter_thread | p3

https://github.com/jedbrown/git-fat/blob/074e89199f880146c0402a8c24e1136bf2bf0414/git-fat#L291

Why don't you use git ls-tree ? Something like this:

p4 | filter_thread | p3

where p4 = git ls-tree --full-tree -l -r HEAD

On performance side:

p1 and cut_thread are removed because git ls-tree already prints type (blob) and size
p4 has better performance than p1

I checkout a big repo like git (https://github.com/git/git):

 % time git rev-list --objects HEAD 2>&1 > /dev/null    
git rev-list --objects HEAD 2>&1 > /dev/null  1,34s user 0,03s system 99% cpu 1,365 total

 % time git ls-tree --full-tree -l -r HEAD 2>&1 > /dev/null
git ls-tree --full-tree -l -r HEAD 2>&1 > /dev/null  0,01s user 0,00s system 97% cpu 0,017 total

it's pretty fast :)

Sample output of git ls-tree

100644 blob 62cb23dfd37743e4985655998ccabd56db160233   11236    xdiff/xutils.c
100644 blob 4646ce575251b07053f20285be99422d6576603e    1844    xdiff/xutils.h
100644 blob 61e6df0fdce6dfaf38da7af996d7fe801db8f00c    6215    zlib.c

Thoughts ?

Olivier.

jedbrown commented 10 years ago

@obazoud Those two commands do very different things. git ls-tree shows the objects in the current tree while git rev-list --objects shows everything in the history.

jmurty commented 10 years ago

Thanks for the merge @jedbrown

jedbrown / git-fat

Improve performance when looking up referenced objects. #37