[WIP] Improve performance of buildIndex

jfontan commented 6 years ago

A but chunk or time from a clone is spent building pack index. In current code the decoder is used to create the index but an specific method to build indexes may be used.

I am testing a two pass system where in the first pass the hashes for non delta objects and hierarchy of objects children are generated. In the delta objects hashes are computed and the index generated. Here are the flame graphs of the index building process with the previous code and the new test.

The test was done cloning git://github.com/numpy/numpy.git code from a local server.

Previous code (11.90s):

buildindex_old

New code (9.52s):

buildindex-new

There is still lots of time used in reading/decompressing/creating crc from the pack file as some objects are read twice and there is a significant time used in PatchDelta growing slice.

I am changing the code to add a cache that may decrease reading some objects twice and making PatchDelta objects bigger so they don't need to be grown. This issue will be updated with new findings.

ajnavarro commented 6 years ago

Related: https://github.com/src-d/go-git/issues/719 but not the same.

jfontan commented 6 years ago

Adding cache made the second part read and decompress less objects and made it faster (7 seconds). Overall memory consumption was increased 3x.

buildindex-cache

jfontan commented 6 years ago

The algorithm was changed to create the dependency tree in the first pass and resolve the objects that had their resolved parents in the cache. The second pass resolves all the delta trees in order keeping only the needed bases in memory. This reduces memory consumption to previous levels and makes each object read twice at most. For small/medium repos it is slower than the current method (12.25s) but it's faster is bigger repositories (around 30% for a 500Mb repo).

buildindex-tree

src-d / go-git

[WIP] Improve performance of buildIndex #748