markfasheh / duperemove

Tools for deduping file systems
GNU General Public License v2.0
794 stars 78 forks source link

no check for C file attribute #251

Open brainchild0 opened 3 years ago

brainchild0 commented 3 years ago

From casual use, it appears that the application gives no special handling to files with a no copy on write (C) file attribute set. However, when the application eventually attempts de-duplication, it produce warning messages for such files.

Perhaps checking the attribute beforehand, and skipping files a set value, is a good enhancement.

lorddoskias commented 3 years ago

First this issue is specific to btrfs since the other supported filesystem (XFS) doesn't seem to be using the nocow flag. The actual issue is that the remap file range ioctl doesn't support calling it on files with different +C states. Namely if one of the files is NOCOW and the other is not -> you get an error. But if you have 2 files with +C then dedupe actually works.

brainchild0 commented 3 years ago

It seems to me that the desired behavior is not to produce de-duplication except with pairs (or larger sets) of files that all have the C attribute clear. Then, it is sufficient and desirable to ignore all files that have the C attribute set.

lorddoskias commented 3 years ago

Well,

It seems to me that the desired behavior is not to produce de-duplication except with pairs (or larger sets) of files that all have the C attribute clear. Then, it is sufficient and desirable to ignore all files that have the C attribute set.

Well, that is your desired behavior which might not necessary be that of other users of duperemove. Also dedupe will be successful if both files have +C. The fact that if both files have +C flag and yet you can perform a dedupe action on btrfs is an implementation artifact of btrfs. So I'd rather not work around it in duperemove.

brainchild0 commented 3 years ago

Well, that is your desired behavior which might not necessary be that of other users of duperemove.

My intention was to express what is desired universally, though I now realize that you hold a different opinion. The behavior I described is what I derive from my understanding of the meaning of the flag, though such understanding may be limited. If Btrfs supports CoW for two files with the flag set, then certainly my understanding is incomplete.

For the sake of a practical example, when I ran an operation, the swap file was scanned. Avoiding this particular action is obviously preferred.

brainchild0 commented 3 years ago

For reference, the following appears in the man page for chattr:

A file with the 'C' attribute set will not be subject to copy-on-write updates. This flag is only supported on file systems which perform copy-on-write. (Note: For btrfs, the 'C' flag should be set on new or empty files. If it is set on a file which already has data blocks, it is undefined when the blocks assigned to the file will be fully stable. If the 'C' flag is set on a directory, it will have no effect on the directory, but new files created in that directory will have the No_COW attribute set.)

My interpretation of the description is that the flag intends to represent a file that has never been subjected to CoW. By inference, then, such a file should not be later subjected to de-duplication, because such would have the same unwanted effect.

It is mysterious why Btrfs allows de-duplication on such files. Perhaps its doing so is a bug or flaw. Otherwise, perhaps it is assumed that an application would not perform such an operation except under special circumstances that warrant it.

lorddoskias commented 3 years ago

nocow doesn't preclude dedup. Because nocow allows references on extents to be incremented, as such if you want to dedup 2 files which have +C that change would constitute a single reference count bump of the extent, so no CoW change would happen but only metadata would be changed. And this is a valid scenario.

brainchild0 commented 3 years ago

What is the difference in final result between making a CoW copy of file versus making a full (non-CoW) copy then de-duplicating the two files?

lorddoskias commented 3 years ago

Dedup actually doesn't touch the data portions of a file. The way it works is you send it a range to remap, the kernel performs a byte comparison on the given range and if the range matches it simply changes the metadata of the filesystem. This is enabled by the fact that extents are stored separately from a file. As such dedup doesn't really constitute a CoW operation (not for the data at least, metadata is always CoW but this applies for ordinary ops with +C as well).

And to answer your direct question - there is no difference for the user between CoW vs no CoW.

brainchild0 commented 3 years ago

The most recent comments appear to confirm my previous understanding.

I am led to reiterate the earlier observation.

The current behavior of the application is to generate a result that is forbidden by the constraints given in the definition of the C attribute. Violating this constraint may cause various detrimental effects. It is necessary to consider that a correct implementation of the application respects this constraint, different from the current behavior.

This observation leaves open the question why Btrfs accepts a call for two files that both share this attribute. Resolving this question upstream would help generate a more reliable result for this discussion.

lorddoskias commented 3 years ago

Feel free to start a discussion on the upstream btrfs mailing list but until then I see no problem with the current behavior.

brainchild0 commented 3 years ago

I agree with this assessment. Please keep the issue open. I have posted the question upstream.

kilobyte commented 3 years ago

The C flag is misnamed — it should be called "cow once" rather than "no cow".

What it really does is:

+C files can be snapshotted which causes reflinks, thus there's no reason to disallow reflinking a different way.

brainchild0 commented 3 years ago

Does the application currently attempt to find ways to de-duplicate across files with the attribute versus without it? If so, then the two sets of files might be handled separately, for optimization.