Use filefrag -sv to fingerprint COW Reflink extents

jrw commented 2 years ago

I would like to suggest an enhancement to the filefrag utility. Here's the scenario I'm envisioning:

I'm using btrfs and would like to determine if two files are identical COW reflinks of one another. So, I use

FILE1_extents=$(filefrag -sv FILE1)
FILE2_extents=$(filefrag -sv FILE2)

Each output looks like:

Filesystem type is: 9123683e
File size of FILE1 is 325869576 (79559 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..   32767:  178486121.. 178518888:  32768:             shared
   1:    32768..   65535:  178527488.. 178560255:  32768:  178518889: shared
   2:    65536..   79558:  178646376.. 178660398:  14023:  178560256: last,shared,eof
FILE1: 3 extents found

If I could remove the mentions of FILE1 on the 2nd line and last line, then I could directly compare the values of $FILE1_extents and $FILE2_extents. I believe that if the two files are on the same file system, then the two values will be identical IFF the data for the two files are identical reflinks of each other.

Here is a possible enhancement which would make that reliable:

Add the device number to the output
Omit the file name when exactly one file is specified on the command line, similar to GNU grep -h

I have created pull request #87 for this enhancement

Note: the case where there is no data block (e.g. inlined data) also has to be handled. That would be a more complicated change to the filefrag code, requiring adding an option. I handle this by grepping for inline|unknown_loc|delalloc to know which files cannot be reflinked.

tasket commented 1 year ago

Since I've started adding filefrag -v output to my own backup utility which uses snapshots (in this case, reflink copies) of disk images to find changes, I think it would be nice to have a plainer or stricter output mode. But I would vote for something like XML or JSON instead of tweaking the current format.

BTW, the lvm utilities have both --no-headings and --delimeter options that make parsing easier even if JSON output would be better.

Here's an example of using filefrag's FIEMAP output to find differences between two reflinked files. It skips over the headings, filters out cols 1,7,8 and removes any non-digit chars before piping that into sort and uniq. After that, you can simply use the two leftmost "logical" columns as a list of ranges where the files differ.

tytso commented 1 year ago

So a couple of things about the device number. Currently FIEMAP does not return the device number. Fetching the device number from stat is going to be highly misleading, since (a) btrfs can support multiple devices, and (b) btrfs can support multiple subvolumes. What btrfs returns in stat is not necessarily the physical device when multiple subvolumes are involved, and when there are multiple devices involved, and a file can span multiple devices, returning a single "device" number is going to be misleading / wrong.

I have no objections to adding a mode where filefrag returns the contents of the returned struct fiemap_extent from the FIEMAP ioctl in some kind of easily parsable format, whether that's CSV, TSV, or JSON. But I don't believe in trying to make up block device numbers by using stat and then returning that as the physical location for all of the fiemap extents, because while that might work for ext4, it most certainly will not always be correct for btrfs. If you want the physical block device for each fiemap extent, please contact the btrfs kernel developers and get them to extend the FIEMAP ioctl. There are reserved fields in the the struct fiemap_extent that could potentially be used for the block device number. But that needs to go into the linux kernel upstream first. And then if filefrag returns information which is "wrong", it will be because it returns exactly what the kernel has returned in struct fiemap extent, and once again, I will refer complainers to the btrfs kernel developers.

tasket commented 1 year ago

@tytso As per my comment in #84, some use cases (like the one in this issue) don't require actual physical addresses and device refs. They only need some consistent addressing scheme, even if its virtual, to show commonalities. (The situation would be similar for an XFS filesystem that is on a raid layer.)

So my comments here are not a complaint, but more of a recognition that @jrw can still accomplish their task using filefrag with the virtual Btrfs numbers in the "physical" columns.

As for the stat device numbers, this will be different for two files that are on different filesystems.

jrw commented 1 year ago

@tytso The idea I'm looking for with a device number is something to disambiguate files on different filesystems. I don't know what that might be for btrfs. My original thought was devno would be appropriate, but I can see that's wrong. The idea behind removing the filename (in the case of only one filename) is to allow the output to be used as is as an identifier for a file's "extent identity" so that id could be compared between files to determine if they are reflinks of each other with exactly matching extents. It appears that the output of filefrag -sv is identical between two reflinked files except for the filename itself.

tytso commented 1 year ago

If you're trying to figure out whether two files belong to the same file system, why not just fetch st_dev via "stat -c %d ", as opposed to asking for a change in filefrag? Again, this may be misleading in some circumstances, because of tricks and games that btrfs is playing with st_dev and the fact that there is not a device specifier in FIEMAP. But these are issues you need to take up with the btrfs developers; they are not under my control.

I don't believe you can can count on filefrag -sv being identical between two reflink flies in all cases, because of the fact that btrfs supports a file system that spans multiple devices. How do you know whether block 42 is on device A or device B? It might work "most" of the time, but again, I don't control btrfs, so please take it up with the btrfs developers.

jrw commented 1 year ago

I am currently using stat to get the st_dev and I will continue to do so if there's nothing that can be done accurately at the level of filefrag. Also, I hear your point that (maybe?) the filefrag information is not (cannot be) totally accurate due to block numbering across multiple devices.

However, I think that the actual change I'm requesting (remove the filename from the output when there is only one file) would not be hard or out-of-line for filefrag to implement. But, if the identification of identical reflink files cannot really be achieved with filefrag output (due to the points you've mentioned), then maybe it's not really a priority.

In any case, I did want to make you aware of this attempt to use the information from filefrag -sv to identify identical reflink files. Also, it would obviously be nice/useful to have some kind of utility directly from the btrfs/xfs developers which could be used to identify these identical reflink files. I don't know how to implement it there (so I couldn't make a PR), but I can raise an issue with them. Thanks for your time!

jrw commented 1 year ago

Submitted https://bugzilla.kernel.org/show_bug.cgi?id=217068 to btrfs devs (hopefully).

tytso commented 1 year ago

It's not that the information printed by filefrag -v is inaccurate; it's just that it is incomplete if your goal is determine whether two reflinked files are pointing at the same set of blocks. You're trying to use filefrag in a way that it wasn't originally intended; for that matter, the FIEMAP ioctl wasn't intended for this use.

My complaint about the patch is that changing filefrag -v to print the st_dev is (a) no different from your using stat(1) to get the st_dev, and (b) might mislead people into thinking that this is actually a reliable way of determining that two files are identical.

I won't object to a patch which has adds an option for filefrag -v to print information in a machine-parsable format (although I wonder if using perl and h2ph to directly call the fiemap ioctl might be an easier path forward for you). It's still not going to be reliable for all btrfs setups, though. If you know that you aren't using any btrfs subvolumes and multiple devices for your btrfs file system, it'll probably work fine --- but if you share it on stackexchange or some such, and someone uses your tool in a way that you don't expect, then it's not doing them a service.

jrw commented 1 year ago

See my new PR https://github.com/tytso/e2fsprogs/pull/132 which omits printing st_dev.

tasket commented 1 year ago

I think there's been a misapprehension about the Btrfs data from FIEMAP. It represents an internally consistent virtual address space, regardless of whether or not the fs contains multiple physical devices. If the misapprehension is on my part, I'd really like to know why for instance the dev numbers are necessary in order to be accurate.

the FIEMAP ioctl wasn't intended for this use

Honestly, its a low-level OS feature designed to provide information. If it has enough block address info to determine fragmentation, it also has enough to determine if two files have the same composition.

tasket commented 1 year ago

Here's the tell for me: No one is objecting to the quality of data (for this use case) when its XFS on top of RAID. No one is saying only specific types of block layers will produce consistent FIEMAP block addresses.

tasket commented 1 year ago

The question for this use case is whether an extent has a unique block address across multiple devices. You are implying in Btrfs it doesn't, or that the ioctl somehow leaves out a part of the address.

It's not that the information printed by filefrag -v is inaccurate; it's just that it is incomplete if your goal is determine whether two reflinked files are pointing at the same set of blocks.

I believe this is to be untrue. FIEMAP data describes the data locations of the file within the context of whatever block layer the fs is using. In Btrfs' case an internal raid-like block layer is used and the need for device numbers is obviated. You won't acknowledge this.

Also, you single-out Btrfs and don't warn against "incomplete" data when using filefrag on top of mdraid. Seems like a blanket caveat against anything raid-like would be in order.

If you know that you aren't using any btrfs subvolumes and multiple devices for your btrfs file system, it'll probably work fine --- but

This appears to be spillover of the Linux developer controversy about Btrfs subvolume inode numbers, and implying something that isn't true. Extents aren't inodes and Btrfs is not hocus pocus. If Btrfs devs like @osandov say that multiple devices are accounted for in their virtual block addresses then I tend to believe them.

At this point I just need to know if filefrag -v will continue to output all of the FIEMAP ioctl results for file data going into the future. If there's no commitment on that point then I'll create a wyng-backup issue to eventually replace filefrag.

tytso commented 1 year ago

In the XFS raid case, the numbers reported by FIEMAP are the logical block numbers for the raid device --- that is, they are suitable for use by things like the GRUB bootloader installation program to open /dev/mdXX, seek to a particular offset, and write the bootloader to the correct location on disk.

In the case of BTRFS, there is no /dev/mdXX RAID device, because btrfs subsumes the the RAID layer. And so if there is a need to directly write to the file system using a physical block number, you need to know which device to open, and which physical block offset it refers to. Finally, FIEMAP does not define some vague, mystical "virtual block numbers". It explicitly talks about physical offsets:

struct fiemap_extent {
        __u64 fe_logical;  /* logical offset in bytes for the start of
                            * the extent from the beginning of the file */
        __u64 fe_physical; /* physical offset in bytes for the start
                            * of the extent from the beginning of the disk */
        __u64 fe_length;   /* length in bytes for this extent */
        __u64 fe_reserved64[2];
        __u32 fe_flags;    /* FIEMAP_EXTENT_* flags for this extent */
        __u32 fe_reserved[3];
};

So, riddle me this. If btrfs supports multiple disks, and fe_physical is the physical offset in bytes from the start of the extent from the beginning of the disk, exactly how does btrfs fill in the fe_physical field? Remember, in btrfs disks can added and removed, so there is no /dev/mdXX RAID device that you can open to get a physical offset with respect to the RAID device (which is how it works when you use XFS on top of MD RAID, or on top of LVM).

tytso commented 1 year ago

By the way. I suggest you meditate on [1] and read these two article [2][3]. BTRFS has functionality which is like RAID, but it is not traditional RAID the way LVM and MD block devices are structured. In particular, I suggest you take a look at how btrfs chunk id's are used and named (via UUID) and how they map to btrfs stripes, which have their own device and physical offsets. There is no such thing as a virtual lba in btrfs; I'm quite convinced you don't know what you are talking about.

At this point, I suggest that you ask that a BTRFS developer, such as Josef Bacik or Chris Mason, talk to me directly. I generally see them at least once or twice a year, at events such as the Linux File Systems, Storage, and MM workshop, as well as the Linux Plumbers events, and we chat about file system issues when we meet.

[1] https://btrfs.wiki.kernel.org/index.php/Data_Structures [2] https://www.oracle.com/technical-resources/articles/it-infrastructure/admin-advanced-btrfs.html [3] https://arstechnica.com/gadgets/2021/09/examining-btrfs-linuxs-perpetually-half-finished-filesystem/

tasket commented 1 year ago

FIEMAP does not define some vague, mystical "virtual block numbers". It explicitly talks about physical offsets:

All this says is that the working definition of "physical" changes depending on the context. Insisting on strict interpretation of those field labels means the "physical" XFS data is also "wrong" without mdraid; Btrfs has internal raid so the "physical" data is "wrong" without Btrfs. RAID is shown as formal Btrfs concept numerous times on the Data Structures page you linked.

For my purposes, only a unique (physical or virtual) address for each extent is required. You questioned how Btrfs could even calculate a logical address and I linked to an example from someone who claims to be a Btrfs developer in kernel.org issues (which is not exactly stackexchange).

From the on-disk format reference (emphasis mine):

Btrfs makes a distinction between logical and physical addresses. Logical addresses are used in the filesystem structures, while physical addresses are simply byte offsets on a disk. One logical address may correspond to physical addresses on any number of disks, depending on RAID settings. The chunk tree is used to convert from logical addresses to physical addresses

Citing data structures:

CHUNK_TREE (3)

The chunk tree holds all DEV_ITEMs and CHUNK_ITEMs, making it possible to determine the device(s) and physical address(es) corresponding to a given logical address. It is therefore crucial for access to the contents of the filesystem.

 EXTENT_DATA (6c)

(inode id, 6c, offset in file) TODO

The contents of a file.
Off     Size    Type    Description
0   8   UINT    generation
8   8   UINT    (n) size of decoded extent
10  1   UINT    compression (0=none, 1=zlib, 2=LZO)
11  1   UINT    encryption (0=none)
12  2   UINT    other encoding (0=none)
14  1   UINT    type (0=inline, 1=regular, 2=prealloc)
15

If the extent is inline, the remaining item bytes are the data bytes (n bytes in case no compression/encryption/other encoding is used).

Otherwise, the structure continues:
Off     Size    Type    Description
15  8   UINT    (ea) logical address of extent. If this is zero, the extent is sparse and consists of all zeroes.
1d  8   UINT    (es) size of extent
25  8   UINT    (o) offset within the extent
2d  8   UINT    (s) logical number of bytes in file
35

ea and es must exactly match an EXTENT_ITEM.

This corroborates what osandov stated about Btrfs using a logical address system to span multiple devices as well as the fact that the provided code displays them. Of course, this program is mainly concerned with adding device numbers to the picture, so its overkill for my application. There is nothing to suggest that logical addresses are localized instead of global, or that they are not unique keys.

Meanwhile, I'm reading repeatedly in this issue about the needs of boot loaders, de-fraggers and the "need to directly write to the file system using a physical block number"... none of which concern the use case of determining data identity. Of course I am fine with Btrfs devs chiming in.

tytso / e2fsprogs

Use filefrag -sv to fingerprint COW Reflink extents #88