uutils / coreutils

Cross-platform Rust rewrite of the GNU coreutils
https://uutils.github.io/
MIT License
17.26k stars 1.24k forks source link

count reflink disk usage only once in du #3906

Open xiota opened 1 year ago

xiota commented 1 year ago

uu-du reports little to no disk usage for hard links, but double (triple or more) counts disk used by reflinks. Would be nice if reflinks were counted only once.

oyiekeallen commented 1 year ago

I am looking to help out, could you explain this a little more?

xiota commented 1 year ago

The following may help illustrate:

$ dd if=/dev/urandom of=file1 bs=4M count=5
5+0 records in
5+0 records out
20971520 bytes (21 MB, 20 MiB) copied, 0.0768516 s, 273 MB/s

$ uu-du
20480   .

$ cp --reflink=auto file1 file2

$ uu-du
40960   .

$ dd if=/dev/urandom of=file3 bs=4M count=5
5+0 records in
5+0 records out
20971520 bytes (21 MB, 20 MiB) copied, 0.0816431 s, 257 MB/s

$ uu-du
61440   .

$ ln file3 file4

$ uu-du
61440   .

$ filefrag -v file1
Filesystem type is: 58465342
File size of file1 is 20971520 (5120 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..    5119:   77046897..  77052016:   5120:             last,shared,eof
file1: 1 extent found

$ filefrag -v file2
Filesystem type is: 58465342
File size of file2 is 20971520 (5120 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..    5119:   77046897..  77052016:   5120:             last,shared,eof
file2: 1 extent found

$ filefrag -v file3
Filesystem type is: 58465342
File size of file3 is 20971520 (5120 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..    5119:   77052017..  77057136:   5120:             last,eof
file3: 1 extent found

$ filefrag -v file4
Filesystem type is: 58465342
File size of file4 is 20971520 (5120 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..    5119:   77052017..  77057136:   5120:             last,eof
file4: 1 extent found

$ if rmlint --is-reflink file1 file2 ; then echo files are reflinked ; fi
files are reflinked

$ dd if=/dev/urandom of=file5 bs=4M count=5 ; filefrag -v file5
5+0 records in
5+0 records out
20971520 bytes (21 MB, 20 MiB) copied, 0.0820128 s, 256 MB/s
Filesystem type is: 58465342
File size of file5 is 20971520 (5120 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..    5119:          0..         0:      0:             last,unknown_loc,delalloc,eof
file5: 1 extent found

When file2, a reflink to file1, is created, uu-du reports disk usage doubles. But when file4, a hard link to file 3, is created, uu-du does not report any additional disk usage.

filefrag reports the same physical_offset for both reflink and hard link.

rmlint may be useful for reference because it has an option to check whether two files are reflinked.

Care needs to be taken when the physical_offset is 0..0 (write cache hasn't been flushed to disk).

On cow filesystems (Btrfs), some extents may be shared, while others aren't. This case can be deferred.

xiota commented 1 year ago

Some pseudocode for how this could be implemented. I don't know rust, so this will look sort of like python:

# create list of offsets, then sort
for f in files:
    offset_list.push((f.physical_offset_start, f.physical_offset_end, f.logical_offset_end-f.logical_offset_start+1))

offset_list.sort()

# instead of creating the list then sorting, may be more efficient to insert into sorted position

total_size = 0
for i in range(offset_list.length()):
    # write cache hasn't been flushed to disk
    if offset_list[i][0] = 0 and offset_list[i][1] = 0;
        total_size += offset_list[i][2]
        continue

    if i < offset_list.length():
        if offset_list[i] == offset_list[i+1]:
            # extent i and i+1 are duplicates
            continue
        elif offset_list[i][0] >= offset_list[i+1][1]:
            # extent i contains extent i+1
            offset_list[i+1][0] = offset_list[i][0]
            offset_list[i+1][1] = offset_list[i][1]
            continue
        elif offset_list[i][0] >= offset_list[i+1][0]:
            # extent i overlaps extent i+1
            offset_list[i+1][0] = offset_list[i][0]
            continue

    total_size += offset_list[i][1] - offset_list[i][0] + 1

return total_size
oyiekeallen commented 1 year ago

Nice, thank you. Will take a look now

cre4ture commented 6 months ago

Hello, is there any update to this topic? @oyiekeallen @xiota I'm searching for a task for a very first rust-lang contribution. This sounds interesting to me. @oyiekeallen as there is no update since more than a year, can I assume that you are not working on it any more?

sylvestre commented 6 months ago

no update for a long time, you can clearly go ahead :)

cre4ture commented 6 months ago

@sylvestre thank for your confirmation :-)

I actually started already before with hacking around in the source code for this topic. But now I'm starting to make everything clean and properly tested for the pull request.

If got a few questions regarding the desired exact behaviour and testing infrastructure:

/mnt/btrfstest/c

du --total largefile1.dd largefile1_ln_hardlink.dd largefile1_cp_reflink_always.dd largefile1_cp_reflink_always_pa rtially_modifed.dd 30720 largefile1.dd 30720 largefile1_cp_reflink_always.dd 30720 largefile1_cp_reflink_always_partially_modifed.dd 92160 total


- [ ] For an automatic test on the base of the full executable and a real filesystem, we need a btrfs filesystem on the host. My machine has just an ext4. Thus I was forced to create for my tests an own btrfs on a loop device and force the test to use a path on that btrfs filesystem. But this needs manual interaction. How can we avoid these manual steps? Shall we enable the test only for hosts with btrfs default filesystem? Or should I introduce a new testing method where the filesystem in mocked?
- [ ] Is there a framework for logging debugging information in the uutils? Are developers supposed to implement logs for the utilities or not? I have some, but implemented in a hacky way.
sylvestre commented 6 months ago

Easy:match what gnu is doing :)

cre4ture commented 6 months ago

Thanks for the quick response. :-)

But it doesn't help me further. I think the "extent-aware du" is actually even a new feature to GNU du. I checked the source code of GNU du and did only find extent or fiemap related things in the cp tool there.

Can you confirm this?

EDIT: 2 years old post: https://superuser.com/questions/1645380/where-is-shared-du-reflink-tools

xiota commented 6 months ago

What I had in mind when I opened this issue is that shared extents would count only once... because that's how much disk space is actually used. So files with 75% extents shared by other files would count only for the 25% that are new.

I don't know rust well enough to do more than write some pseudocode for a potential algorithm.

As far as I know, gnu du doesn't implement this. (At least, not in version available on my system.) If the objective is to follow gnu du, then implementing this should probably be deferred.

cre4ture commented 6 months ago

lets see, maybe I'll then try to implement this in gnu du as well such that we have for sure the same implementation :-D

cre4ture commented 6 months ago

@xiota May I ask you to create an issue with the same content as this for gnu du? Here is the link: https://github.com/coreutils/coreutils/issues

sylvestre commented 6 months ago

Sorry but I would prefer we focus our energy on first being at parity with GNU before adding such features.