Open darkbasic opened 1 year ago
@rincebrain interesting, but somehow it doesn't work according to the time
results.
--reflink=always
calls ioctl(FICLONE)
. Linux inspects this call before passing it to the filesystem, and will reject it if source and destination files are not on a filesystem with the same superblock (and before 5.18, the same mountpoint).
This is not an OpenZFS bug as such, because if Linux would pass the call down, we would quite happily service it. Working around Linux's check here is extremely difficult, if its even possible.
(the Btrfs example is not really relevant; OpenZFS and Btrfs have a fundamentally different construction and purpose).
--reflink=auto
calls the copy_file_range()
syscall, which in this case means "make a new file with the same contents as this existing one and I don't care how you do it". Often OpenZFS can service this with a clone, but not always (for many good reasons). If it can't, it'll fall back to a regular content copy.
The call time is not a very good indicator that a clone vs a copy was done. To tell if it was cloned you currently have to dig around with zdb
. But in any case, that's acceptable behaviour, because --reflink=auto
(copy_file_range()
allows it).
Yes, this sucks. We'll keep working on it but its complicated.
btrfs doesn't trip this because they show up subvolumes as the same "mountpoint" which is why this check doesn't bite them.
The problem is not only with different datasets, but with snapshots as well:
[niko@arch-phoenix ~]$ cp --reflink=always /home/.zfs/snapshot/zrepl_20231004_074556_000/niko/devel/linux-mainline.tar.gz .
cp: failed to clone './linux-mainline.tar.gz' from '/home/.zfs/snapshot/zrepl_20231004_074556_000/niko/devel/linux-mainline.tar.gz': Invalid cross-device link
[niko@arch-phoenix ~]$ mount | grep home
rpool/home on /home type zfs (rw,nodev,relatime,xattr,posixacl,casesensitive)
rpool/home@zrepl_20231004_074556_000 on /home/.zfs/snapshot/zrepl_20231004_074556_000 type zfs (ro,relatime,xattr,posixacl,casesensitive)
Accessing snapshots create a different mountpoint which triggers the same issue.
Filesystems and snapshots are different datasets. Or, put another way, they're different mounts, and so different superblocks, and so Linux rejects the request.
I understand that this is frustrating, but no amount of pointing it out is going to make a quick fix happen. I'm aware of four possible solutions (or shapes of solutions):
cp
)zfs clonefile
or similar command to do clones directly inside OpenZFSI've been quietly exploring all of these options for a few weeks now. They are all difficult and/or complicated, for different reasons, and I also have very little time available to look at it. If you've got some other idea, I'm happy to hear it.
My only suggestion is to reach Kent Overstreet @koverstreet and ask him what are his plans for bcachefs regarding this. Lifting the Linux restriction would obviously be the best course of action and working with a (soon-to-be) mainline filesystem would be much easier.
The reason why --reflink=auto
(which in turn calls the copy_file_range()
syscall) might not create clones is encryption. We can't clone encrypted blocks across datasets because the key material is partially bound to the source dataset (actually its encryption root). https://github.com/openzfs/zfs/pull/14705 has a start on this.
If you're going to take something I wrote and pass it off as your own you should at least adjust it to match the context. That's why it didn't work in your case. It can and does create clones in many other situations.
In any case, there's no bug here. These limitations are well understood and will be worked on as time and interest allows.
If you're going to take something I wrote and pass it off as your own you should at least adjust it to match the context.
That was not my intention and I'm sorry if it felt that way. I'm just trying my best to juggle the relevant information across the two threads so that anybody who stumbles upon either of them will understand what's going on.
That's why it didn't work in your case
I thought that was clear enough, but I've edited my past message to change "does" with "might".
Linux lifting the restriction
I think getting a patch into the linux kernel to lift the cross device link restriction can be pretty hard. There is no usecase in the kernel and no kernel (fs)driver wich need this. I don't think that the linux kernel will accept for example a new ioctl-flag FICLONEC
for cross device links without any usecase inside the kernel and I think the existing flags won't be changed as well, because cross reflinking over different datasets doesn't make sense. Don't know for sure but I quess so.
Unfortunately even bcachefs won't be a suitable candide because it already supports reflinking across different subvolumes.
@darkbasic Perhaps, but regardless, Kent's very smart and knowledgeable and I think nice enough to give some objective and thoughtful opinions that may be helpful.
cross reflinking over different datasets doesn't make sense. Don't know for sure but I quess so.
From the user point of view, it makes lots of sense, like reorganizing data (files) between datasets.
Perhaps, but regardless, Kent's very smart and knowledgeable and I think nice enough to give some objective and thoughtful opinions that may be helpful.
I did write to him and I linked this issue, but he simply replied that bcachefs can already reflink between subvolumes. I guess he's pretty busy with his own stuff.
From the user point of view, it makes lots of sense, like reorganizing data (files) between datasets.
Definitely.
Personally, I'd be ok with zfs clonefile
, handling it internally and the barrier to a functional implementation would not have such a substantial roadblock. I'd surmise this would work as a cross-platform solution as well.
FWIW - I'm just glad that this now exists (think large VM base/fluid type images), even if there are some barriers preventing it from full (envisioned) functionality.
@darkbasic Yea, I read some of the Linux kernel dev emails and noticed many jumping on Kent as he was trying to get bcachefs upstream, seemed very "tense and stressful" for him. (skipping linking painful emails)
Interesting quote from Kent: "Right now what I'm hearing, in particular from Redhat, is that they want it upstream in order to commit more resources. Which, I know, is not what kernel people want to hear, but it's the chicken-and-the-egg situation I'm in." (source https://lore.kernel.org/lkml/20230706173819.36c67pf42ba4gmv4@moria.home.lan/ )
Anyway, was really great to see we got a Halloween "present" and looks like Bcachefs was finally merged into the 6.7 kernel by Linus at the end of October! So now Linux has OCFS2, Btrfs, XFS, Bcachefs, and ZFS all supporting reflinks. Will be interesting to do feature/performance benchmark comparisons as well as seeing how the various filesystems do when put on a zfs ZVOL. I wish that ZVOLs had gotten a bit more love, but I digress...
Have any parts of ZFS gotten "rusty" at all over the years?? ( https://lwn.net/Articles/934692/ )
+1
FWIW I have a use case re: httm. I implemented FICLONE for httm
specifically to make use of this feature with respect to ZFS.
Personally, I'd be ok with zfs clonefile, handling it internally and the barrier to a functional implementation would not have such a substantial roadblock. I'd surmise this would work as a cross-platform solution as well.
+1
Not that I deserve an opinion re: implementation, but I think a new subcommand/arg zfs clonefile
and library function make more sense the greater the variance from default functionality. If ZFS could allow writes not simply across snapshots to live datasets (something btrfs does), but across datasets (rpool/srv
to rpool/program
) or to a sub-dataset (rpool/srv
to rpool/srv/program
), and Linux probably never will, then it makes sense to me to add a new subcommand.
@darkbasic I'm seeing the same on Ubuntu with coreutils 9.1 installed, --reflink=auto
does not attempt to call copy_file_range
. I am able to make it successfully reflink accross datasets by specifying --sparse=never --reflink=auto
.
Coreutils should have an --reflink=zfs
option that would simply call the zfs internal clone function and have zfs do its own internal checks, across datasets or within an encrypted clone family.
For some reason the --sparse=auto
detection in coreutils
9.1 fails for me, resulting in cp
always trying a sparse copy unless I specify --sparse=never
. When doing sparse copies cp
does not use copy_file_range
. I'm currently trying to build coreutils
9.4 to see if things changed.
From trying to copy a 1GiB file:
lseek(3, 0, SEEK_DATA) = 0
fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
lseek(3, 0, SEEK_HOLE) = 1073741824
lseek(3, 0, SEEK_SET) = 0
Seems to me that would indicate there is no hole in the source file, so why does cp
still treat it as such?
Newer coreutils
versions also added a --debug
switch for cp
which might be helpful for diagnosing this: https://github.com/coreutils/coreutils/commit/d899f9e3320bb2a4727ca894163f6b104118c973
I am able to make it successfully reflink accross datasets by specifying --sparse=never --reflink=auto
You don't use native encryption, right? Otherwise I'm a little bit puzzled because it's not supposed to work with encryption.
P.S. You should disable block cloning altogether until things get figured out because of corruption reports on Gentoo.
I am able to make it successfully reflink accross datasets by specifying --sparse=never --reflink=auto
You don't use native encryption, right? Otherwise I'm a little bit puzzled because it's not supposed to work with encryption.
P.S. You should disable block cloning altogether until things get figured out because of corruption reports on Gentoo.
No encryption but LZ4 compression. I mean reflinking works if I pass the right options, it is the sparse detection that seems to fail in coreutils 9.1. I created the source file using dd if=/dev/urandom of=/path/file bs=1M count=1024
, so it is definitely not a sparse file, yet cp
treats it as such.
I know about the data corruption issue, thanks.
I know about the data corruption issue, thanks.
Does anyone have an issue number for those of us who don't?
Here it is: https://github.com/openzfs/zfs/issues/15526
* Linux lifting the restriction * Adding OpenZFS-specific calls that Linux doesn't know about (and so won't intercept), and then adding support for those to common tools (like `cp`) * Adding `zfs clonefile` or similar command to do clones directly inside OpenZFS * Significantly modify OpenZFS to use the same superblock for all mounts
I think zfs clonefile
would be a good start. I have one more idea:
zfs-clonefile
being its own binary which might be symlinked as cp
and mv
and might mimic behavior of the linked command when being called with --reflink
or
cp
or mv
and then 'capture' relevant syscalls and turn them into zfs internal syscalls that will create clones successfully? But I am not sure this would be even possible.
System information
Describe the problem you're observing
Reflinking doesn't work across different datasets. Since https://lore.kernel.org/linux-btrfs/cover.1645194730.git.josef@toxicpanda.com/T/#mf251325026fe2e15ed5119856bf654ba4f0d298b btrfs allows to reflink across different subvolumes, so it should be possible to achieve something similar in Linux with zfs. Not being able to reflink across different datasets vastly reduce the utility of reflinking.
Describe how to reproduce the problem
cp -a --reflink=always /path/to/first/dataset/file /path/to/second/dataset/
Include any warning/errors/backtraces from the system logs
P.S. I have been told that
--reflink=auto
should be able to clone blocks across different datasets, but this isn't the case:It took 3 seconds compared to 0.1 seconds when the dataset was the same, suggesting that reflinking didn't work across datasets.