newren / git-filter-repo

Quickly rewrite git repository history (filter-branch replacement)
Other
8.52k stars 708 forks source link

filter-repo: parse blobs after commits #609

Closed vapaamies closed 1 month ago

vapaamies commented 1 month ago

Performance gain on blob callbacks, now they applied only to filtered commits.

newren commented 1 month ago

Performance gain on blob callbacks, now they applied only to filtered commits.

You didn't test this to verify the performance improvements you claimed, did you?

This doesn't change the order that blobs & commits are emitted by fast-export, it merely changes the order in which we check whether the current line of output from fast-export is a blob/tag/commit/etc. But changing the order in which we check what the current line represents does not change what the current line represents. Since fast-export will always output blobs before the commits that use them, and we parse the information in the order it appears in the stream, we continue parsing blobs before commits with your patch.

Further, the idea of parsing blobs after commits fundamentally makes little sense. Commits have to refer to the blobs used in the commit, otherwise they cannot be written. There's no way for us to have blobs listed later and handle it, unless we completely rearchitected the system to do some kind of deferred resolution. That'd be a pretty big change to the code, likely increase memory pressure and computation overhead, and likely cost more than what you're attempting to save. Especially since...

Even if we created some kind of deferred resolution mechanism so we could parse commits before blobs, doing that would have little performance benefit, because it still forces fast-export to process the blobs we'll end up not using (they'll still be inflated and written to the stream, and parsed by us), and we'll still have to parse it, and we'll likely still write it to the fast-import stream (it just won't be used since the commit will filter out the path), and fast-import will still need to deflate it. I suspect that unless you have an expensive blob callback, its operation is probably in the noise compared to these other steps that you haven't removed.

If you want a way to avoid filtering blobs that are irrelevant (e.g. because you are also using --path directives to trim your repository down), then you have a couple options: