openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.42k stars 1.72k forks source link

zvol snapshot inconsistent with source #15875

Closed aayushshah15 closed 6 months ago

aayushshah15 commented 6 months ago

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 22.04
Kernel Version 5.15.0-89-generic
Architecture amd64
OpenZFS Version 2.1.5

Describe the problem you're observing

We're observing that a zvol snapshot is not consistent with its source zvol immediately after the snapshot was taken, without any modifications being made to the source zvol.

At a high level, we're creating clones of a base zvol (which contains an ext4 formatted ubuntu filesystem) to pass off to firecracker microVMs. We're seeing that sometimes the microVM detects that its dpkg database is in a corrupt state. It seems to point to the /var/lib/dpkg/info/format file being empty as the reason for this.

Manually inspecting the contents of our base zvol (which is unmounted immediately after it is hydrated with our ubuntu rootfs) confirms that the /var/lib/dpkg/info/format is not empty, whereas the same file in a clone of the snapshot is empty.

Describe how to reproduce the problem

Here is a snippet from an ansible playbook that seems to reliably reproduce the issue

    - name: Extract the tar file to the zvol
      ansible.builtin.shell:
          cmd: "tar -xf {{ destination_directory }}/{{ image_name.split(':')[1] }}.tar -C /mnt/rootfs-{{ image_tag }}"

    - name: Assert that the zvol contains "1" in the /var/lib/dpkg/info/format file.
      ansible.builtin.shell:
        cmd: "grep -q 1 /mnt/rootfs-{{ image_tag }}/var/lib/dpkg/info/format"
      register: format_file
      until: format_file.rc == 0

    - name: Create a snapshot of the zvol
      ansible.builtin.shell:
        cmd: "zfs snapshot bspool/rootfs-{{ image_tag }}@latest"

    - name: Clone the snapshot
      ansible.builtin.shell:
        cmd: "zfs clone bspool/rootfs-{{ image_tag }}@latest bspool/rootfs-{{ image_tag }}-latest"

    - name: Mount the cloned zvol
      ansible.builtin.shell:
        cmd: "mount -t ext4 /dev/zvol/bspool/rootfs-{{ image_tag }}-latest /mnt/rootfs-{{ image_tag }}-latest"

    - name: Assert that snapshot contains "1" in the /var/lib/dpkg/info/format file.
      ansible.builtin.shell:
        cmd: "grep -q 1 /mnt/rootfs-{{ image_tag }}-latest/var/lib/dpkg/info/format"
      register: format_file
      until: format_file.rc == 0

In this script, the first assertion (second step) succeeds but the second assertion (the last step) doesn't. Are we misunderstanding something here?

Include any warning/errors/backtraces from the system logs

rincebrain commented 6 months ago

I'd probably suggest taking a snapshot after like, another minute or two on the origin, then use zstream dump to see what an incremental from the snapshot that you cloned to the current one says changed, because without more data, my guess would be some weird interaction where because it's got a dirty journal at the moment of the snapshot, it's clearing it on mount.

Something like a zpool sync before the snapshot might be a hacky workaround for your use case atm. Depends what's happening.

aayushshah15 commented 6 months ago

Something like a zpool sync before the snapshot might be a hacky workaround for your use case atm. Depends what's happening.

zpool syncing before the clone doesn't seem to help, and since we're using zvols, the output of zstream dump wont be consumable. My sense is that this is a common enough usecase that we're likely holding something wrong, as opposed to hitting a real bug. Would appreciate any other pointers.

rincebrain commented 6 months ago

The point of suggesting zstream dump was a more easy to explain to you how to read version of "look at which parts of the zvol object changed to go look at what those structures on the filesystem in the zvol contain".

You could also accomplish that with diffing a very verbose zdb's output, if you really wanted to, but that's going to be literal MB of output for any nontrivially sized zvol.

aayushshah15 commented 6 months ago

update: we're no longer seeing the inconsistency as long as we unmount the zvol before snapshotting it, so there's likely some (undocumented?) interaction here that was causing the stated behavior.

amotin commented 6 months ago

@aayushshah15 Since you put another file system (ext4) on top of ZVOL, it likely has its own write caches, content of which is invisible for ZVOL yet when you snapshot it. You should flush those caches before snapshotting. Ideally unmount it. Otherwise on next mount from snapshot you'll see it potentially inconsistent, as if system has crashed at the time.

aayushshah15 commented 6 months ago

That makes sense and lines up with what we're seeing, thanks @rincebrain and @amotin. I'll close this issue.