Closed filipnavara closed 5 years ago
I'm investigating a decrease in speed when using these changes it gitbase.
When iterating all trees with a cold cache it's faster with these changes (35s vs 25s). We keep the cache and in a second run the previous go-git code is faster (16s vs 25s). It seems the cache is not being filled correctly and in one profile I can see that Packfile.objectAtOffset
uses 3.13s an with these changes it uses 11.08s, most of it in recreating again the objects that should be in cache.
There's couple of flaws I didn't realize when coding this.
Firstly, it slightly regresses scenario where someone was accessing the small delta objects only to check the header. Now it falls back to decoding the content which can be comparatively expensive. This notably regresses packfile.GetByType
iterations and by proxy the high level methods like CommitObjects
. Not only it will read more data from the file it can end up loading enough to cause cache spills.
Secondly, I assumed that if the delta size is small and the expanded size is small then it should not be a big issue to load the base object and just apply the delta. The flaw is that the delta object can reference huge base object but only reuse a small part of it.
I have ideas on how to fix both of the issues without introducing new regressions. It will require significant reworking of the logic and a lot of testing though.
After more investigation I've found that the iterator is not using the cache to retrieve objects so delta objects are regenerated even when they are already in the cache. After enabling it the performance increases but triggers a bug I am investigating.
After more investigation I've found that the iterator is not using the cache to retrieve objects so delta objects are regenerated even when they are already in the cache.
I forgot to fix the "cold" code path in the iterator to try the cache.
Note that if offsetToType
is kept around (ie. packfiles are reused) it will hit the cache properly because it will always find the offset in this map and then either skip over the object or load it through GetByOffset
which is checking the cache.
After enabling it the performance increases but triggers a bug I am investigating.
I see... I will look into it tomorrow :-)
Some of the tests don't handle errors properly. Thus the underlying error ("invalid delta") crashes them hard. For some reason already undeltified data get to PatchDelta as input.
I think I figured out what's wrong... getObjectType
in the iterator code can seek and thus the next getNextObject
call reads wrong data.
Thanks!
@mcuadros Now this patch results in better performance in all my tests.
In #963 I addressed issue with reading many small objects from a pack file. I originally opted not to apply the optimization to delta objects. It turns out that when traversing a commit history the tree objects are often small and delta encoded. This extends the original optimization to handle this case too. Tested on https://github.com/rails/rails repository and it improves the complete history traversal time by ~20%.