newren / git-filter-repo

Quickly rewrite git repository history (filter-branch replacement)
Other
8.21k stars 698 forks source link

Have --analyze provide frequency of changes to individual paths as well #392

Open pmartindev opened 2 years ago

pmartindev commented 2 years ago

When using filter-repo --analyze on problematic repos, especially those with a deep history, I often times find that the problematic blobs are the ones that are not necessarily the largest, but those of a medium size that are committed either by an automated system or a common binary being frequently rewritten by users. Is there currently an easy way to output the frequency of blobs committed, if not, would there be interest in having a blob-frequency.txt with the blob name and count?

newren commented 1 year ago

I don't understand what good that would do. If you have any blob repeated a trillion times, Git will only store one copy of it, so removing highly duplicated blobs doesn't shrink the size of history (well, it does, but only by the size of a single compressed copy).

What can matter is lots of different but similar blobs of medium size, but then you need a way to find those "similar" blobs. The best ways I know of to do that are: (1) filename (e.g. someone stores a medium sized blob in a given file, then keeps tweaking that file throughout history), (2) directory (e.g. storing a bunch of medium sized blobs together), (3) extension (e.g. lots of image files or pdfs or presentations or whatever). filter-repo already has facilities for those, though.

The only place highly duplicated medium sized blobs could cause problems that I can think of is in git bombs where the concern is not history size but checkout size. However, that's a case where all the blobs are part of the same commit just checked out at many different paths. But you specifically brought up a deep repository history which isn't at all a requirement for git bombs, which make me think it's not relevant for your case.

Perhaps you could clarify a bit more what you are seeing?

pmartindev commented 1 year ago

Thanks for commenting @newren ! Yes, in particular, I am referring to your scenario 1 where a blob with a particular filename is being rewritten in history many times. I see this often, when working with many monorepos, where developers unfamiliar with proper git paradigms will either manually or via an automated process commit the same binary throughout history. Because it's only a small-medium size blob, it not easy to discern from the analysis outputs that it is a good candidate for repository cleanup (unless a user is specifically looking for filename repetition).

This is where I think a blobs-by-count.txt or something similar would be beneficial. It would essentially just be a grep -c of each of the blobs in the blobs-shas-and-paths.txt, but I think it would add value to highlight this to users of the tool.

Hopefully that clears it up a bit! I don't think this would be a computationally expensive file to generate, and the file itself would be smaller than the blobs-shas-and-paths.txt file, which I know can tend to be 100mb+ on very large repositories 😅. I am also happy to contribute to this as well 😄

newren commented 1 year ago

Sorry, I'm still confused. My scenario 1 was someone committing a blob (let's say it's hash abbreviates to deadbeef01) at some path (let's say subdir/dir2/somefile.ext), then updating those contents periodically, meaning the blob changes. So, subdir/dir2/somefile.ext has an abbreviated hash of deadbeef01 at first, then after the next update it has an abbreviated hash of deadbeef02, then deadbeef03, etc. Checking for duplication by blob_id would give you a count of 1.

I also don't understand what you mean by a grep -c of each of the blobs in the blobs-shas-and-paths.txt file. Trying to take a guess since I'm not sure what you're getting at, is it possible you essentially want blobs-shas-and-paths.txt sorted by lines with the most commas? (When one blob maps to multiple files, the paths where the blob appeared are shown in a comma-separated list). Or something else? And if it is what you mean, does it really indicate stuff that needs to be cleaned up? Even git.git, which has virtually no binaries, has >1500 blobs that appear at multiple paths, and 35 of them that appear at 3 or more paths, and one appears at 32 paths. (And no, that highlight duplicated blob isn't a binary; it's a small-ish text file.)

JLuszawski commented 1 year ago

If I get @pmartindev right, he wants to know how many times a blob was changed. Since path-all-sizes.txt is already sorted by size, all you need to add is simple grep -c, as you mentioned, and you can tailor it to your needs. It can be as simple as ( for f in $(awk '{print $4}' < path-all-sizes.txt ); do echo -n $f " "; grep -c $f blob-shas-and-paths.txt ; done; ) | grep -v ' 1$' or, if you want also the occupied size: ( for f in $(awk '{print $4 "|" $2}' < path-all-sizes.txt ); do echo -n "${f%|*} | ${f#*|} | "; grep -c ${f%|*} blob-shas-and-paths.txt ; done; ) | grep -v ' 1$' I know it's not elegant (does not allow spaces in the filename, etc.), but it is simple CLI, and anyone can have his own variations of use.

newren commented 1 year ago

Ah, a list of paths with a count of changes to the content stored at that path. Thanks for the explanation. Maybe in a file named "frequency-of-changes-per-path.txt" ? I'd be fine with something like that.

(As a side note: the above awk/grep pipelines have a potential shortcoming in that they only count the number of unique blobs stored at a given path, which means that if someone reverts to an older version that change wouldn't be counted. At an extreme, if there was a weird history where people repeatedly reverted the contents of some path back and forth between A and B, and did that thousands of times, the frequency count by the above awk/grep pipelines would only be two, because there are only 2 unique blobs that were ever stored at that path, even if there were thousands of changes to the path. Not sure if that matters, but just thought I'd point it out for when you go to implement it.)

pmartindev commented 1 year ago

Thanks for the explanation @JLuszawski. That is exactly what I was try to convey 😄 In regards to your comment @newren, the scenarios I'm looking to count are when the unpacked/packed size changes. The intent being to identify paths that frequently change and consume a lot of storage. Given this, would it be more appropriate to instead of creating an entirely new file, create a "number of changes" column in the path-all-sizes.txt file?

newren commented 1 year ago

That's a thought I had too. However, I kind of like the fact that the six files matching *-all-sizes.txt and *-deleted-sizes.txt all have the same format; I'd kind of like to avoid tweaking just one of them. But I'd be fine with duplicating the unpacked/packed size changes in the new file that has frequency counts.