kilobyte / compsize

btrfs: find compression type/ratio on a file or set of files
Other
343 stars 23 forks source link

Feature request: compsize --find #46

Open Forza-tng opened 2 years ago

Forza-tng commented 2 years ago

I'd like an option to use compsize to find files with extents matching some criteria and list them.

compsize --find could have the following matches:

It should be possible to combine several matches.

The output can be a table with the path/filename + the chosen matches.

A possibly to sort the output would be great, or have an output format that can be piped to sort.

My initial use-case is to find highly fragmented files so that I can manually defrag them. But also to analyse my files and how they are to determine if I should do some action on them.

kilobyte commented 2 years ago

Any of the numbers other than shared extents are easy to get; shared extents would require two passes as we don't know yet about other files that are yet to be processed.

On the other hand, designing and implementing a reasonable interface is not trivial.

Forza-tng commented 2 years ago

Any of the numbers other than shared extents are easy to get; shared extents would require two passes as we don't know yet about other files that are yet to be processed.

On the other hand, designing and implementing a reasonable interface is not trivial.

Don't we get shared extents today, at least within the search path? (the referenced vs actual usage)

# compsize home/
Processed 11696 files, 5763 regular extents (21001 refs), 1134 inline.
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL       56%       89M         158M         471M
none       100%       58M          58M         175M
zstd        30%       30M          99M         296M
Zygo commented 2 years ago

"Referenced" counts the number of times each reference is seen in the files named on the command line. It doesn't tell you whether the underlying extents are shared.

It doesn't handle hard links, and merely repeating a file name gets all of its references "shared":

# compsize foo
Processed 1 file, 1 regular extents (1 refs), 0 inline.
Type       Perc     Disk Usage   Uncompressed Referenced  
TOTAL      100%      9.6M         9.6M         9.6M       
none       100%      9.6M         9.6M         9.6M       
# compsize foo foo
Processed 2 files, 1 regular extents (2 refs), 0 inline.
Type       Perc     Disk Usage   Uncompressed Referenced  
TOTAL      100%      9.6M         9.6M          19M       
none       100%      9.6M         9.6M          19M       

An extent will appear to be unshared if you didn't provide files containing all of the references on the command line:

# cp --reflink=always foo bar
# compsize foo
Processed 1 file, 1 regular extents (1 refs), 0 inline.
Type       Perc     Disk Usage   Uncompressed Referenced  
TOTAL      100%      9.6M         9.6M         9.6M       
none       100%      9.6M         9.6M         9.6M       
# compsize foo bar
Processed 2 files, 1 regular extents (2 refs), 0 inline.
Type       Perc     Disk Usage   Uncompressed Referenced  
TOTAL      100%      9.6M         9.6M          19M       
none       100%      9.6M         9.6M          19M       

If we want to know whether an extent is shared in general, we have to look at the backrefs and reference counter for the extent in the extent tree (if it's >1 then we know immediately it is shared), then work backwards up the subvol tree to see if there are multiple roots in its ancestry (we can stop as soon as we find a second parent). The first step is an easy TREE_SEARCH on the extent tree. The second step is expensive: the choices are to use the LOGICAL_INO ioctl, which will find every reference, so it's more work than needed to calcuate shared/not-shared; or read the block device directly, which adds significant extra complexity and some race conditions that will need to be handled somehow (accept lower accuracy or run a retry loop).

That said, running LOGICAL_INO on each unique extent would do the job, it will just take more time and IO than compsize would normally use.