src-d / go-git

Project has been moved to: https://github.com/go-git/go-git
https://github.com/go-git/go-git
Apache License 2.0
4.91k stars 542 forks source link

plumbing: packfile, apply small object reading optimization also for delta objects #1121

Closed filipnavara closed 5 years ago

filipnavara commented 5 years ago

In #963 I addressed issue with reading many small objects from a pack file. I originally opted not to apply the optimization to delta objects. It turns out that when traversing a commit history the tree objects are often small and delta encoded. This extends the original optimization to handle this case too. Tested on https://github.com/rails/rails repository and it improves the complete history traversal time by ~20%.

jfontan commented 5 years ago

I'm investigating a decrease in speed when using these changes it gitbase.

When iterating all trees with a cold cache it's faster with these changes (35s vs 25s). We keep the cache and in a second run the previous go-git code is faster (16s vs 25s). It seems the cache is not being filled correctly and in one profile I can see that Packfile.objectAtOffset uses 3.13s an with these changes it uses 11.08s, most of it in recreating again the objects that should be in cache.

filipnavara commented 5 years ago

There's couple of flaws I didn't realize when coding this.

Firstly, it slightly regresses scenario where someone was accessing the small delta objects only to check the header. Now it falls back to decoding the content which can be comparatively expensive. This notably regresses packfile.GetByType iterations and by proxy the high level methods like CommitObjects. Not only it will read more data from the file it can end up loading enough to cause cache spills.

Secondly, I assumed that if the delta size is small and the expanded size is small then it should not be a big issue to load the base object and just apply the delta. The flaw is that the delta object can reference huge base object but only reuse a small part of it.

I have ideas on how to fix both of the issues without introducing new regressions. It will require significant reworking of the logic and a lot of testing though.

jfontan commented 5 years ago

After more investigation I've found that the iterator is not using the cache to retrieve objects so delta objects are regenerated even when they are already in the cache. After enabling it the performance increases but triggers a bug I am investigating.

filipnavara commented 5 years ago

After more investigation I've found that the iterator is not using the cache to retrieve objects so delta objects are regenerated even when they are already in the cache.

I forgot to fix the "cold" code path in the iterator to try the cache.

Note that if offsetToType is kept around (ie. packfiles are reused) it will hit the cache properly because it will always find the offset in this map and then either skip over the object or load it through GetByOffset which is checking the cache.

filipnavara commented 5 years ago

After enabling it the performance increases but triggers a bug I am investigating.

I see... I will look into it tomorrow :-)

Some of the tests don't handle errors properly. Thus the underlying error ("invalid delta") crashes them hard. For some reason already undeltified data get to PatchDelta as input.

filipnavara commented 5 years ago

I think I figured out what's wrong... getObjectType in the iterator code can seek and thus the next getNextObject call reads wrong data.

jfontan commented 5 years ago

Thanks!

@mcuadros Now this patch results in better performance in all my tests.