openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.34k stars 1.72k forks source link

Pool reports ZFS-8000-8A after zfs send but then clears #16147

Open gaykitty opened 2 months ago

gaykitty commented 2 months ago

System information

Type Version/Name
Distribution Name NixOS
Distribution Version 23.11
Kernel Version 6.7.10
Architecture x84_64
OpenZFS Version 2.2.3-1

Describe the problem you're observing

When trying to send particular snapshots (as determined by syncoid), the sending pool will start reporting ZFS-8000-8A, with an empty file list. But then there error clears on it own after 5-30 seconds and multiple runs of zpool status -v. The output of zfs send is also invalid.

Describe how to reproduce the problem

Happens every time, don't know how it got in this state.

Include any warning/errors/backtraces from the system logs

zfs status output after zfs send:

  pool: data
 state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 00:03:14 with 0 errors on Tue Apr 30 21:01:52 2024
config:

    NAME                          STATE     READ WRITE CKSUM
    data                          ONLINE       0     0     0
     mirror-0                    ONLINE       0     0     0
       scsi-0HC_Volume_30264525  ONLINE       0     0     0
       scsi-0HC_Volume_31118548  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

zfs status output after 5-30 seconds and multiple runs of zpool status -v:

  pool: data
 state: ONLINE
  scan: scrub repaired 0B in 00:03:14 with 0 errors on Tue Apr 30 21:01:52 2024
config:

    NAME                          STATE     READ WRITE CKSUM
    data                          ONLINE       0     0     0
     mirror-0                    ONLINE       0     0     0
       scsi-0HC_Volume_30264525  ONLINE       0     0     0
       scsi-0HC_Volume_31118548  ONLINE       0     0     0

errors: No known data errors

Output of zfs send -I 'data/postgres@autosnap_2024-04-30_00:00:02_daily' 'data/postgres@autosnap_2024-05-01_01:00:13_hourly' | zstreamdump:

BEGIN record
    hdrtype = 2
    features = 0
    magic = 2f5bacbac
    creation_time = 0
    type = 0
    flags = 0x0
    toguid = 0
    fromguid = 0
    toname = data/postgres@autosnap_2024-05-01_01:00:13_hourly
    payloadlen = 0
END checksum = 4e3d8c4e3/13074827678/2629d7a733f5/34974e936a868
BEGIN record
    hdrtype = 1
    features = 4
    magic = 2f5bacbac
    creation_time = 66304295
    type = 2
    flags = 0xc
    toguid = 952f620822cc13b3
    fromguid = 18d0c7688a7b650d
    toname = data/postgres@autosnap_2024-04-30_01:00:05_hourly
    payloadlen = 0
warning: cannot send 'data/postgres@autosnap_2024-04-30_01:00:05_hourly': Input/output error
BEGIN record
    hdrtype = 1
    features = 4
    magic = 2f5bacbac
    creation_time = 663050a2
    type = 2
    flags = 0xc
    toguid = 8e478c81e645c5c2
    fromguid = 952f620822cc13b3
    toname = data/postgres@autosnap_2024-04-30_02:00:02_hourly
    payloadlen = 0
invalid checksum
Incorrect checksum in record header.
Expected checksum = f9770daf2/9c3131f152a/35dc04be10236/d38c2d9e9ccd71
SUMMARY:
    Total DRR_BEGIN records = 3 (0 bytes)
    Total DRR_END records = 1 (0 bytes)
    Total DRR_OBJECT records = 0 (0 bytes)
    Total DRR_FREEOBJECTS records = 0 (0 bytes)
    Total DRR_WRITE records = 0 (0 bytes)
    Total DRR_WRITE_BYREF records = 0 (0 bytes)
    Total DRR_WRITE_EMBEDDED records = 0 (0 bytes)
    Total DRR_FREE records = 0 (0 bytes)
    Total DRR_SPILL records = 0 (0 bytes)
    Total records = 4
    Total payload size = 0 (0x0)
    Total header overhead = 1248 (0x4e0)
    Total stream length = 1560 (0x618)
rincebrain commented 2 months ago

Can you share the output of zfs get all on data/postgres, otherwise it's going to be rather difficult to try and reproduce what's going on here.

gaykitty commented 2 months ago

@rincebrain here ya go:

NAME           PROPERTY              VALUE                            SOURCE
data/postgres  type                  filesystem                       -
data/postgres  creation              Mon Apr 10 23:15 2023            -
data/postgres  used                  221M                             -
data/postgres  available             25.5G                            -
data/postgres  referenced            117M                             -
data/postgres  compressratio         1.71x                            -
data/postgres  mounted               yes                              -
data/postgres  quota                 none                             default
data/postgres  reservation           none                             default
data/postgres  recordsize            8K                               local
data/postgres  mountpoint            /var/lib/postgresql/             local
data/postgres  sharenfs              off                              default
data/postgres  checksum              on                               default
data/postgres  compression           on                               default
data/postgres  atime                 on                               default
data/postgres  devices               on                               default
data/postgres  exec                  on                               default
data/postgres  setuid                on                               default
data/postgres  readonly              off                              default
data/postgres  zoned                 off                              default
data/postgres  snapdir               hidden                           default
data/postgres  aclmode               discard                          default
data/postgres  aclinherit            restricted                       default
data/postgres  createtxg             174                              -
data/postgres  canmount              on                               local
data/postgres  xattr                 on                               default
data/postgres  copies                1                                default
data/postgres  version               5                                -
data/postgres  utf8only              off                              -
data/postgres  normalization         none                             -
data/postgres  casesensitivity       sensitive                        -
data/postgres  vscan                 off                              default
data/postgres  nbmand                off                              default
data/postgres  sharesmb              off                              default
data/postgres  refquota              none                             default
data/postgres  refreservation        none                             default
data/postgres  guid                  970173609522556085               -
data/postgres  primarycache          metadata                         local
data/postgres  secondarycache        all                              default
data/postgres  usedbysnapshots       103M                             -
data/postgres  usedbydataset         117M                             -
data/postgres  usedbychildren        0B                               -
data/postgres  usedbyrefreservation  0B                               -
data/postgres  logbias               throughput                       local
data/postgres  objsetid              274                              -
data/postgres  dedup                 off                              default
data/postgres  mlslabel              none                             default
data/postgres  sync                  standard                         default
data/postgres  dnodesize             legacy                           default
data/postgres  refcompressratio      2.02x                            -
data/postgres  written               1.64M                            -
data/postgres  logicalused           351M                             -
data/postgres  logicalreferenced     225M                             -
data/postgres  volmode               default                          default
data/postgres  filesystem_limit      none                             default
data/postgres  snapshot_limit        none                             default
data/postgres  filesystem_count      none                             default
data/postgres  snapshot_count        none                             default
data/postgres  snapdev               hidden                           default
data/postgres  acltype               off                              default
data/postgres  context               none                             default
data/postgres  fscontext             none                             default
data/postgres  defcontext            none                             default
data/postgres  rootcontext           none                             default
data/postgres  relatime              on                               default
data/postgres  redundant_metadata    all                              default
data/postgres  overlay               on                               default
data/postgres  encryption            aes-256-gcm                      -
data/postgres  keylocation           file:///var/secrets/zfs.key      local
data/postgres  keyformat             raw                              -
data/postgres  pbkdf2iters           0                                default
data/postgres  encryptionroot        data/postgres                    -
data/postgres  keystatus             available                        -
data/postgres  special_small_blocks  0                                default
data/postgres  snapshots_changed     Wed May  1 15:00:06 2024         -
data/postgres  nixos:shutdown-time   Mon Mar 18 12:20:20 AM EDT 2024  inherited from data
rincebrain commented 2 months ago

My assumption would be that something is erroring on decryption during send, since it's not using -w, and somehow because it's transient it's vanishing almost immediately, but the error is still breaking your send stream.

zpool events -v might know what it was that it burped on, though probably not why. (My guess would be something strange like it's hitting some metadata object where DVA[0] fails to decrypt but DVA[1] does and somehow native encryption is screwing up error handling so trying DVA[1] doesn't stop it from returning an error? Not sure, I've not seen this reproduce locally, but people have occasionally reported strange hiccups from native encryption that made me wonder if something strange was going on even after 2163cde450d0898b5f7bac16afb4e238485411ff.)

gaykitty commented 2 months ago

I was just poking at my system and noticed that a NixOS update had downgraded the kernel version to 6.6.29, but I didn't notice and hadn't rebooted. So I rebooted by my system. After the reboot my pool showed errors like before. After 2 scrubs the errors cleared and doing the zfs send from before no longer produces the bug.