openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.62k stars 1.75k forks source link

COW cp (--reflink) doesn't work across different datasets: Invalid cross-device link #15345

Open darkbasic opened 1 year ago

darkbasic commented 1 year ago

System information

Type Version/Name
Distribution Name Arch Linux
Distribution Version
Kernel Version 6.6.0-rc4
Architecture amd64
OpenZFS Version git branch zfs-2.2-release + 6.6 compatibility patches

Describe the problem you're observing

Reflinking doesn't work across different datasets. Since https://lore.kernel.org/linux-btrfs/cover.1645194730.git.josef@toxicpanda.com/T/#mf251325026fe2e15ed5119856bf654ba4f0d298b btrfs allows to reflink across different subvolumes, so it should be possible to achieve something similar in Linux with zfs. Not being able to reflink across different datasets vastly reduce the utility of reflinking.

Describe how to reproduce the problem

cp -a --reflink=always /path/to/first/dataset/file /path/to/second/dataset/

Include any warning/errors/backtraces from the system logs

[niko@arch-phoenix ~]$ cp --reflink=always ~/.cache/yay/chromium-wayland-vaapi/chromium-117.0.5938.132.tar.xz .
cp: failed to clone './chromium-117.0.5938.132.tar.xz' from '/home/niko/.cache/yay/chromium-wayland-vaapi/chromium-117.0.5938.132.tar.xz': Invalid cross-device link

P.S. I have been told that --reflink=auto should be able to clone blocks across different datasets, but this isn't the case:

[niko@arch-phoenix ~]$ time cp ~/.cache/yay/chromium-wayland-vaapi/chromium-117.0.5938.132.tar.xz . && rm chromium-117.0.5938.132.tar.xz

real    0m3.136s
user    0m0.000s
sys 0m2.852s
[niko@arch-phoenix .cache]$ time cp yay/chromium-wayland-vaapi/chromium-117.0.5938.132.tar.xz . && rm chromium-117.0.5938.132.tar.xz

real    0m0.127s
user    0m0.000s
sys 0m0.127s

It took 3 seconds compared to 0.1 seconds when the dataset was the same, suggesting that reflinking didn't work across datasets.

rincebrain commented 1 year ago

See here.

darkbasic commented 1 year ago

@rincebrain interesting, but somehow it doesn't work according to the time results.

robn commented 1 year ago

--reflink=always calls ioctl(FICLONE). Linux inspects this call before passing it to the filesystem, and will reject it if source and destination files are not on a filesystem with the same superblock (and before 5.18, the same mountpoint).

This is not an OpenZFS bug as such, because if Linux would pass the call down, we would quite happily service it. Working around Linux's check here is extremely difficult, if its even possible.

(the Btrfs example is not really relevant; OpenZFS and Btrfs have a fundamentally different construction and purpose).

--reflink=auto calls the copy_file_range() syscall, which in this case means "make a new file with the same contents as this existing one and I don't care how you do it". Often OpenZFS can service this with a clone, but not always (for many good reasons). If it can't, it'll fall back to a regular content copy.

The call time is not a very good indicator that a clone vs a copy was done. To tell if it was cloned you currently have to dig around with zdb. But in any case, that's acceptable behaviour, because --reflink=auto (copy_file_range() allows it).

Yes, this sucks. We'll keep working on it but its complicated.

rincebrain commented 1 year ago

btrfs doesn't trip this because they show up subvolumes as the same "mountpoint" which is why this check doesn't bite them.

darkbasic commented 1 year ago

The problem is not only with different datasets, but with snapshots as well:

[niko@arch-phoenix ~]$ cp --reflink=always /home/.zfs/snapshot/zrepl_20231004_074556_000/niko/devel/linux-mainline.tar.gz .
cp: failed to clone './linux-mainline.tar.gz' from '/home/.zfs/snapshot/zrepl_20231004_074556_000/niko/devel/linux-mainline.tar.gz': Invalid cross-device link
[niko@arch-phoenix ~]$ mount | grep home
rpool/home on /home type zfs (rw,nodev,relatime,xattr,posixacl,casesensitive)
rpool/home@zrepl_20231004_074556_000 on /home/.zfs/snapshot/zrepl_20231004_074556_000 type zfs (ro,relatime,xattr,posixacl,casesensitive)

Accessing snapshots create a different mountpoint which triggers the same issue.

robn commented 1 year ago

Filesystems and snapshots are different datasets. Or, put another way, they're different mounts, and so different superblocks, and so Linux rejects the request.

I understand that this is frustrating, but no amount of pointing it out is going to make a quick fix happen. I'm aware of four possible solutions (or shapes of solutions):

I've been quietly exploring all of these options for a few weeks now. They are all difficult and/or complicated, for different reasons, and I also have very little time available to look at it. If you've got some other idea, I'm happy to hear it.

darkbasic commented 1 year ago

My only suggestion is to reach Kent Overstreet @koverstreet and ask him what are his plans for bcachefs regarding this. Lifting the Linux restriction would obviously be the best course of action and working with a (soon-to-be) mainline filesystem would be much easier.

darkbasic commented 1 year ago

The reason why --reflink=auto (which in turn calls the copy_file_range() syscall) might not create clones is encryption. We can't clone encrypted blocks across datasets because the key material is partially bound to the source dataset (actually its encryption root). https://github.com/openzfs/zfs/pull/14705 has a start on this.

robn commented 1 year ago

If you're going to take something I wrote and pass it off as your own you should at least adjust it to match the context. That's why it didn't work in your case. It can and does create clones in many other situations.

In any case, there's no bug here. These limitations are well understood and will be worked on as time and interest allows.

darkbasic commented 1 year ago

If you're going to take something I wrote and pass it off as your own you should at least adjust it to match the context.

That was not my intention and I'm sorry if it felt that way. I'm just trying my best to juggle the relevant information across the two threads so that anybody who stumbles upon either of them will understand what's going on.

That's why it didn't work in your case

I thought that was clear enough, but I've edited my past message to change "does" with "might".

oromenahar commented 1 year ago

Linux lifting the restriction

I think getting a patch into the linux kernel to lift the cross device link restriction can be pretty hard. There is no usecase in the kernel and no kernel (fs)driver wich need this. I don't think that the linux kernel will accept for example a new ioctl-flag FICLONEC for cross device links without any usecase inside the kernel and I think the existing flags won't be changed as well, because cross reflinking over different datasets doesn't make sense. Don't know for sure but I quess so.

darkbasic commented 1 year ago

Unfortunately even bcachefs won't be a suitable candide because it already supports reflinking across different subvolumes.

jittygitty commented 1 year ago

@darkbasic Perhaps, but regardless, Kent's very smart and knowledgeable and I think nice enough to give some objective and thoughtful opinions that may be helpful.

lvd2 commented 1 year ago

cross reflinking over different datasets doesn't make sense. Don't know for sure but I quess so.

From the user point of view, it makes lots of sense, like reorganizing data (files) between datasets.

darkbasic commented 1 year ago

Perhaps, but regardless, Kent's very smart and knowledgeable and I think nice enough to give some objective and thoughtful opinions that may be helpful.

I did write to him and I linked this issue, but he simply replied that bcachefs can already reflink between subvolumes. I guess he's pretty busy with his own stuff.

From the user point of view, it makes lots of sense, like reorganizing data (files) between datasets.

Definitely.

TerraTech commented 1 year ago

Personally, I'd be ok with zfs clonefile, handling it internally and the barrier to a functional implementation would not have such a substantial roadblock. I'd surmise this would work as a cross-platform solution as well.

FWIW - I'm just glad that this now exists (think large VM base/fluid type images), even if there are some barriers preventing it from full (envisioned) functionality.

jittygitty commented 1 year ago

@darkbasic Yea, I read some of the Linux kernel dev emails and noticed many jumping on Kent as he was trying to get bcachefs upstream, seemed very "tense and stressful" for him. (skipping linking painful emails)

Interesting quote from Kent: "Right now what I'm hearing, in particular from Redhat, is that they want it upstream in order to commit more resources. Which, I know, is not what kernel people want to hear, but it's the chicken-and-the-egg situation I'm in." (source https://lore.kernel.org/lkml/20230706173819.36c67pf42ba4gmv4@moria.home.lan/ )

Anyway, was really great to see we got a Halloween "present" and looks like Bcachefs was finally merged into the 6.7 kernel by Linus at the end of October! So now Linux has OCFS2, Btrfs, XFS, Bcachefs, and ZFS all supporting reflinks. Will be interesting to do feature/performance benchmark comparisons as well as seeing how the various filesystems do when put on a zfs ZVOL. I wish that ZVOLs had gotten a bit more love, but I digress...

Have any parts of ZFS gotten "rusty" at all over the years?? ( https://lwn.net/Articles/934692/ )

kimono-koans commented 11 months ago

+1

FWIW I have a use case re: httm. I implemented FICLONE for httm specifically to make use of this feature with respect to ZFS.

Personally, I'd be ok with zfs clonefile, handling it internally and the barrier to a functional implementation would not have such a substantial roadblock. I'd surmise this would work as a cross-platform solution as well.

+1

Not that I deserve an opinion re: implementation, but I think a new subcommand/arg zfs clonefile and library function make more sense the greater the variance from default functionality. If ZFS could allow writes not simply across snapshots to live datasets (something btrfs does), but across datasets (rpool/srv to rpool/program) or to a sub-dataset (rpool/srv to rpool/srv/program), and Linux probably never will, then it makes sense to me to add a new subcommand.

EchterAgo commented 11 months ago

@darkbasic I'm seeing the same on Ubuntu with coreutils 9.1 installed, --reflink=auto does not attempt to call copy_file_range. I am able to make it successfully reflink accross datasets by specifying --sparse=never --reflink=auto.

scineram commented 11 months ago

Coreutils should have an --reflink=zfs option that would simply call the zfs internal clone function and have zfs do its own internal checks, across datasets or within an encrypted clone family.

EchterAgo commented 11 months ago

For some reason the --sparse=auto detection in coreutils 9.1 fails for me, resulting in cp always trying a sparse copy unless I specify --sparse=never. When doing sparse copies cp does not use copy_file_range. I'm currently trying to build coreutils 9.4 to see if things changed.

EchterAgo commented 11 months ago

From trying to copy a 1GiB file:

lseek(3, 0, SEEK_DATA)                  = 0
fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
lseek(3, 0, SEEK_HOLE)                  = 1073741824
lseek(3, 0, SEEK_SET)                   = 0

Seems to me that would indicate there is no hole in the source file, so why does cp still treat it as such?

EchterAgo commented 11 months ago

Newer coreutils versions also added a --debug switch for cp which might be helpful for diagnosing this: https://github.com/coreutils/coreutils/commit/d899f9e3320bb2a4727ca894163f6b104118c973

EchterAgo commented 11 months ago

Ah https://github.com/coreutils/coreutils/commit/879d2180d6b58e7a83312681fbce9a1e841c2ae4

But https://github.com/coreutils/coreutils/commit/4f92de58220226e4a2ddf3475bacee2bae7f0e1d

darkbasic commented 11 months ago

I am able to make it successfully reflink accross datasets by specifying --sparse=never --reflink=auto

You don't use native encryption, right? Otherwise I'm a little bit puzzled because it's not supposed to work with encryption.

P.S. You should disable block cloning altogether until things get figured out because of corruption reports on Gentoo.

EchterAgo commented 11 months ago

I am able to make it successfully reflink accross datasets by specifying --sparse=never --reflink=auto

You don't use native encryption, right? Otherwise I'm a little bit puzzled because it's not supposed to work with encryption.

P.S. You should disable block cloning altogether until things get figured out because of corruption reports on Gentoo.

No encryption but LZ4 compression. I mean reflinking works if I pass the right options, it is the sparse detection that seems to fail in coreutils 9.1. I created the source file using dd if=/dev/urandom of=/path/file bs=1M count=1024, so it is definitely not a sparse file, yet cp treats it as such.

I know about the data corruption issue, thanks.

strugee commented 11 months ago

I know about the data corruption issue, thanks.

Does anyone have an issue number for those of us who don't?

darkbasic commented 11 months ago

Here it is: https://github.com/openzfs/zfs/issues/15526

mschiff commented 11 months ago
* Linux lifting the restriction
* Adding OpenZFS-specific calls that Linux doesn't know about (and so won't intercept), and then adding support for those to common tools (like `cp`)
* Adding `zfs clonefile` or similar command to do clones directly inside OpenZFS
* Significantly modify OpenZFS to use the same superblock for all mounts

I think zfs clonefile would be a good start. I have one more idea:

or