Question about "refcount mismatch"

mabod commented 6 years ago

System information

Type	Version/Name
Distribution Name	Manjaro
Distribution Version	rolling release
Linux Kernel	4.15.4
Architecture	amd64
ZFS Version	0.7.5-1
SPL Version	0.7.5-1

I find a handful of issues here related to "refcount mismatch" but reading them was not helping me to understand the urgency of a message like space map refcount mismatch: expected 240 != actual 232

Describe the problem you're observing

My pool zstore is not having issues according to zpool status:

27# zpool status zstore
  pool: zstore
 state: ONLINE
  scan: scrub repaired 0B in 4h19m with 0 errors on Sun Feb 11 21:13:50 2018
config:

    NAME                 STATE     READ WRITE CKSUM
    zstore               ONLINE       0     0     0
      mirror-0           ONLINE       0     0     0
        WD-WCC4E5HF3P4S  ONLINE       0     0     0
        WD-WCC4E1SSP28F  ONLINE       0     0     0
      mirror-1           ONLINE       0     0     0
        WD-WCC4E1SSP6NC  ONLINE       0     0     0
        WD-WCC7K7EK9VC4  ONLINE       0     0     0

errors: No known data errors

But zdb -u is reporting a refcount mismatch

Uberblock:
    magic = 0000000000bab10c
    version = 5000
    txg = 161284
    guid_sum = 7197309111221853808
    timestamp = 1519192594 UTC = Wed Feb 21 06:56:34 2018
    mmp_magic = 00000000a11cea11
    mmp_delay = 0

space map refcount mismatch: expected 240 != actual 232

I excecuted zdb -mc zstore in the hope that this issue gets fixed but it doesnt help. The output at the end looks like this:

Traversing all blocks to verify metadata checksums and verify nothing leaked ...

loading space map for vdev 1 of 2, metaslab 115 of 116 ...
2.93T completed (3378MB/s) estimated time remaining: 0hr 00min 08sec          
    No leaks (block sum matches space maps exactly)

    bp count:         4527224
    ganged count:           0
    bp logical:    3372952921600      avg: 745037
    bp physical:   3245456411136      avg: 716875     compression:   1.04
    bp allocated:  3247705292800      avg: 717372     compression:   1.04
    bp deduped:             0    ref>1:      0   deduplication:   1.00
    SPA allocated: 3247705292800     used: 40.74%

    additional, non-pointer bps of type 0:      51561
    Dittoed blocks on same vdev: 127212

space map refcount mismatch: expected 240 != actual 232

So my questions are:

Is this pool in trouble?
How am I supposed to deal with this? Any cure?

Thanks Matthias

behlendorf commented 6 years ago

Is this pool in trouble?

It depends. This may be harmless if you're running zdb on an active imported pool. To determine for certain if there's an issue make sure you export the pool then run zdb on it and see if it detects the same problem.

How am I supposed to deal with this? Any cure?

If you still observe the issue it means that the spacemaps describing the free space in your pool have somehow been damaged. Currently the only guarenteed cure for this is to use zfs send/recv to copy the contents of the exiting pool to a newly created pool on different devices.

mabod commented 6 years ago

export the pool then run zdb on it

I dont know what you mean. When I export the pool and run zbd -u zstore

zdb: can't open 'zstore': No such file or directory

and zdb -ue zstore says:

zdb: can't open 'zstore': File exists

If I move /etc/zfs/zpool.cache and try zdb -ue zstore again it gives the same error:

Uberblock:
    magic = 0000000000bab10c
    version = 5000
    txg = 162321
    guid_sum = 7197309111221853808
    timestamp = 1519278560 UTC = Thu Feb 22 06:49:20 2018
    mmp_magic = 00000000a11cea11
    mmp_delay = 0

space map refcount mismatch: expected 240 != actual 232

mabod commented 6 years ago

I succeeded somehow to get zdb output from the exported pool. I moved /etc/zfs/zpool.cache and ran zdb -eC zstore: This gives the same space map refcount mismatch:

1# zdb -eC zstore

MOS Configuration:
        version: 5000
        name: 'zstore'
        state: 1
        txg: 162390
        pool_guid: 14393956364711311496
        errata: 0
        hostid: 4294967295
        hostname: 'rakete'
        com.delphix:has_per_vdev_zaps
        vdev_children: 2
        vdev_tree:
            type: 'root'
            id: 0
            guid: 14393956364711311496
            children[0]:
                type: 'mirror'
                id: 0
                guid: 6083005762724255950
                metaslab_array: 259
                metaslab_shift: 35
                ashift: 12
                asize: 4000771997696
                is_log: 0
                create_txg: 4
                com.delphix:vdev_zap_top: 129
                children[0]:
                    type: 'disk'
                    id: 0
                    guid: 2998250447430390224
                    path: '/dev/disk/by-vdev/WD-WCC4E5HF3P4S-part1'
                    devid: 'ata-WDC_WD40EZRZ-00WN9B0_WD-WCC4E5HF3P4S-part1'
                    phys_path: 'pci-0000:00:17.0-ata-2'
                    whole_disk: 1
                    create_txg: 4
                    com.delphix:vdev_zap_leaf: 130
                children[1]:
                    type: 'disk'
                    id: 1
                    guid: 8762276478183427200
                    path: '/dev/disk/by-vdev/WD-WCC4E1SSP28F-part1'
                    devid: 'ata-WDC_WD40EZRZ-00WN9B0_WD-WCC4E1SSP28F-part1'
                    phys_path: 'pci-0000:00:17.0-ata-3'
                    whole_disk: 1
                    create_txg: 4
                    com.delphix:vdev_zap_leaf: 131
            children[1]:
                type: 'mirror'
                id: 1
                guid: 3605884340580067442
                metaslab_array: 256
                metaslab_shift: 35
                ashift: 12
                asize: 4000771997696
                is_log: 0
                create_txg: 4
                com.delphix:vdev_zap_top: 132
                children[0]:
                    type: 'disk'
                    id: 0
                    guid: 15809553155304475366
                    path: '/dev/disk/by-vdev/WD-WCC4E1SSP6NC-part1'
                    devid: 'ata-WDC_WD40EZRZ-00WN9B0_WD-WCC4E1SSP6NC-part1'
                    phys_path: 'pci-0000:00:17.0-ata-4'
                    whole_disk: 1
                    create_txg: 4
                    com.delphix:vdev_zap_leaf: 133
                children[1]:
                    type: 'disk'
                    id: 1
                    guid: 10884614783416580978
                    path: '/dev/disk/by-vdev/WD-WCC7K7EK9VC4-part1'
                    devid: 'ata-WDC_WD40EZRZ-00GXCB0_WD-WCC7K7EK9VC4-part1'
                    phys_path: 'pci-0000:00:17.0-ata-1'
                    whole_disk: 1
                    create_txg: 4
                    com.delphix:vdev_zap_leaf: 134
        features_for_read:
            com.delphix:hole_birth
            com.delphix:embedded_data
space map refcount mismatch: expected 240 != actual 232

So you say this pool is in trouble? What next?

mabod commented 6 years ago

This whole thing is making me nervous. And obviously I am not alone. There are more issues raised about the same thing. E.g. Issue #3094 is open since Feb. 2015 but nowhere an explanation why this is happening and if it dangerous. Possible solutions are discussed since 2015 either. See issue #3111 which is still open.

I am reading all this and can not get my arms around it. If this space map refcount mismatch is really a problem, and @behlendorf and others suggest it can be a problem, then why are issues open for it since 2015 without any final decision what to do with it? On the other hand, if space map refcount mismatch is not a problem why arent the developers saying so and closing all these open issue and stop zdb reporting it in the first place. I am confused about how this issue is handled.

The proposed solution

copy the contents of the exiting pool to a newly created pool on different devices.

sounds odd to me. This is not the zero-maintenance experience I was looking for when moving to ZFS.

I want to understand the root cause first before I destroy the pool. I am afraid that the issue happens again shortly after the new pool is in place. In fact, my pool here is just a few weeks old built with new discs. If it takes ZFS just a few weeks to mess up a pool it is not worth creating a new one. And issue #6985 shows evidence that it happens even with newly created pools.

I can only think of 2 root causes:

Either this is a hardware issue and then I need to get new discs or
this is a ZFS issue and ZFS messed it up. If that is the case, then this is a bug in ZFS

So what is it?

MyPod-zz commented 6 years ago

This isn't a hardware issue but rather an issue with ZFS. What follows is just my understanding of the matter, which may or may not be correct. There isn't a clear reproducer that can be used to understand why this issue happens, and that's why there's not been much progress in fixing it. There aren't tools to perform offline scanning and correction of issues like this and similar that could rebuild the spacemaps, so the only "fix" is to create a new pool and move all the data to it through zfs send.

mabod commented 6 years ago

Ok. So it does not look like that this gets fixed soon. More than 3 years old and no hope. What a negative surprise! Super-duper, zero maintenance ZFS filesystem has a nasty bug which can corrupt your data and nobody is able to fix it. To me, this is a nightmare which even exceeds the btrfs stories I have heard. I can hardly believe that. Sorry if that sounds rude or polemic but I am currently going through a deep depression ;-)

Ok. Enough wining. Back to the facts and one of my initial questions:

How dangerous is this bug for the data? What must happen so that it causes data corruption? Can somebody please do a risk assessment for this bug?

ebsebs commented 6 years ago

Hit the same problem after playing with zfs on Gentoo (4.14.21/x86-64 kernel, 0.7.6 zfs/spl)

# zpool status
  pool: z
 state: ONLINE
  scan: resilvered 2.93M in 0h0m with 0 errors on Fri Feb 23 21:00:19 2018
config:

    NAME                                   STATE     READ WRITE CKSUM
    z                                      ONLINE       0     0     0
      mirror-0                             ONLINE       0     0     0
        ata-WDC_WD80EFZX-68UW8N0_xxxxxxxx  ONLINE       0     0     0
        ata-WDC_WD80EFZX-68UW8N0_xxxxxxxx  ONLINE       0     0     0
        ata-WDC_WD80EFZX-68UW8N0_xxxxxxxx  ONLINE       0     0     0
      mirror-1                             ONLINE       0     0     0
        ata-WDC_WD80EFZX-68UW8N0_xxxxxxxx  ONLINE       0     0     0
        ata-WDC_WD80EFZX-68UW8N0_xxxxxxxx  ONLINE       0     0     0
        ata-WDC_WD80EFZX-68UW8N0_xxxxxxxx  ONLINE       0     0     0

Trivial to reproduce:

created a pool as listed above, copied some data.
shutdown machine, disconnect last hard disk, boot again. observe mirror-1 in degraded mode, copy some more data.
shutdown machine, connect disk back, boot observe zpool got resilvered, clear error count
- zdb -m z shows refcount mismatch
```
....
space map refcount mismatch: expected 60 != actual 54
```

mabod commented 6 years ago

Interesting, @ebsebs ! I just realized that my pool - RAID10 - was also imported once with one drive missing. May be that introduced the refcount mismatch for me too.

ebsebs commented 6 years ago

I'm really curious what devs think about this - either I'm doing something weird (which is possible, I spent only several days trying zfs out) or I don't understand how this can be called production ready...

mabod commented 6 years ago

I don't understand how this can be called production ready.

This is exactly my point. And what I find really odd is that this bug is known since several years and I can not find any risk assessment from any ZFS developer. They have a skeleton in the closet and nobody is talking about it.

sempervictus commented 6 years ago

Good chance it stores more data than any other disk-level FS, including at well known OS/device/cloud vendors for the everyday nonsense people keep on their devices and unwittingly clone to these systems on regular basis. Its more production ready than anything else meeting the capacity and functional requirements it has... Your options for alternatives can be described as limited or laughable, and i personally suggest you look at the commit logs for EXT4 (not coming close to the capabilities in ZFS) in the Linux kernel, and get a spoon for your humble pie before badmouthing this body of work.

That said, this is annoying, there have been quite a few public discussions on this (skeletons in closets are hidden, for contrast) - see my own comments in issues and IRC, but its not a fatal flaw. The space map referenced in that error you're seeing is the free space map, blocks not used or marked as free during operations (a block which was deleted and not referenced by any snapshots). This isnt a data error, its a "how much space do i have available?" error. There are some other implications to this which you can find in the issues you referenced, but in practice, this doesnt really kill production-grade systems for a few reasons. First off, production engineers tend to have release/deployment tiers which detect operational faults even when functional testing in CI does not. Secondly, a few blocks of free space matter less to the production consumer than you'd think - cost is calculated at the $/GB ratio, so 6 blocks of mismatch are not relevant (unless something actually bad happens). Lastly, they manage their pools at the data layer, not block layer - fragmentation, snaps, disk failure, all of that eventually makes pools "less than optimal" and their contents can be streamed to other pools which implicitly defrags and optimizes the data layout. Sort of pouring a mixture from one glass to another to better distribute the contents.

ZFS can be used for personal systems, home servers, etc. I do, have for years. Its designed for the datacenter however (which is where we use it to back OpenStack clouds with all their fun and varied needs), which includes datacenter engineers. The community is making it more user accessible by the day, and adding more interfaces friendly to the educated end-user, but at it's core, ZFS (nor any system with critical responsibilities) is not a fire-and-forget solution which solves all the problems of a domain for you. There are suggestions about for how this can be addressed, if you've the bandwidth to tackle them, it would be awesome to close out all the issues listed.

ebsebs commented 6 years ago

@sempervictus, I know what production means, thank you very much, I work for a company in Mountain View. I've read all previous discussions I could find and your take on this problem directly contradicts what @behlendorf said elsewhere that this bug can actually cause data loss. I personally could care less if this refcount inconsistency only causes some free space loss or "less than optimal" data layout.

sempervictus commented 6 years ago

could, in theory, if you manage to map a block as having been freed when it isn't. If this was a serious problem experienced in the field, in prod, it'd be addressed by the companies who sell ZFS commercially. Since that's not the case, they're still making money, and have not been sued into the dirt for losing data, one may conclude that it's a nuisance not a critical bug. We have run petabytes of data through ZoL and BSD, would be complaining a bunch more if it was being eaten. If you have the time, a free space verification tool along the lines of the offline scrub PR would be useful.

mabod commented 6 years ago

@sempervictus : Thank you for this clarification. My concerns regarding this bug stem from the fact that only very few experts have actually assesst the risk associated with this bug. One of those few was @behlendorf who wrote in #3094:

if a spacemap were somehow wrong it could result in permanent damage to a file. For example, if the spacemap indicated a block was free when it wasn't then a new write could overwrite existing data in a snapshot.

He actually did this in a direct reply to you. For me this contradicts what you are stating here:

this is annoying ... but ... its not a fatal flaw ... This isnt a data error

I hope you can understand that this is confusing for amateurs like me. Therefor I very much appreciate that experts like you or behlendorf and others come to a final, mutually agreed risk assessment of this issue. That was my intention when I opened this issue.

sempervictus commented 6 years ago

Sorry for the confusion, this is a problem which can, mathematically result in such a condition. My commentary is intended to point out that practical collisions between a block marked as free when its not and allocated to other data are not common. Situations where you have allocation pressure coupled with these issues could get bad I imagine. I could see a nearly full pool doing file erasures and writing other data while having mirror members removed this way actually cause an issue if the removal can result in a mismatch claiming free when used as opposed to not marking free when its not actually used anymore. May merit some testing to see if we can actually merc data. Betcha that'll light a fire to fix the root cause.

-------- Original Message -------- From: mabod notifications@github.com Sent: Saturday, February 24, 2018 04:17 PM To: zfsonlinux/zfs zfs@noreply.github.com Subject: Re: [zfsonlinux/zfs] Question about "refcount mismatch" (#7204) CC: RageLtMan rageltman@sempervictus.com,Mention mention@noreply.github.com

@sempervictus : Thank you for this clarification. My concerns regarding this bug stem from the fact that only very few experts have actually assesst the resik associated with this bug. One of those few was @behlendorf who wrote in #3094:

if a spacemap were somehow wrong it could result in permanent damage to a file. For example, if the spacemap indicated a block was free when it wasn't then a new write could overwrite existing data in a snapshot.

He actually did this in a direct reply to you. For me this contradicts what you are stating here:

this is annoying ... but ... its not a fatal flaw ... This isnt a data error

I hope you can understand that this is confusing for amateurs like me. Therefor I very much appreciate that experts like you or behlendorf and others come to a final, mutually agreed risk assessment of this issue. That was my intention when I opened this issue.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/zfsonlinux/zfs/issues/7204#issuecomment-368260988

behlendorf commented 6 years ago

@mabod @ebsebs believe me I share your concern about this issue. I'd like nothing more than to get to the root cause of the problem and resolve it. Unfortunately, reproducing this specific issue so it can be understood has proven to be challenging. Despite having good test coverage for this in the test suite (including ztest which verifies the space maps frequently) the issue has not been reproducible. In practice, we also don't see it manifesting itself on the vast majority of systems. It's exceptionally rare given the number of systems deployed. For example, I've never seen it on any of our 1000+ systems which have been running for many years now.

As for implementing workarounds like rebuilding the space maps after they've been damaged there's no doubt that would be a convenient feature. However, we'd much rather get to the underlying root cause and resolve whatever is causing this so that's never needed.

@ebsebs thank you for trying to put together a reproducer for this. You got my hopes up there. Unfortunately I wasn't able to reproduce the issue using your method with 0.7.6 or zfs-0.7.0-340-g4af6873af6 on CentOS 7.4. If someone is able to reproduce this issue in a VM I'd love to dig in to it and get it resolved. But until it can somehow be reproduced in a controlled test environment there's not much which can be done.

I'm unlocking this issue and closing it as a duplicate #3094. Again I'm happy to investigate if someone can figure out how to reproduce it reliably.

ebsebs commented 6 years ago

@behlendorf, I didn't mention but I had spacemap_histogram feature enabled by default on zpool when I was able to reproduce the problem. If I don't have this feature enabled, I cannot reproduce it anymore. What kind of setup you'd prefer for a repro? I can try in Gentoo VM (vmware fusion)

behlendorf commented 6 years ago

@ebsebs pretty much any distribution I can easily install in a VM locally would be fine. I'm not particularly familar with Gentoo so I'd prefer one of the other major distributions with a stock kernel if that's possible. But the key thing would be the specific steps to reproduce it since then we'll hopefully be able to distill it down in to a test case which can be added to the full test suite. Incidentally, I did have the spacemap_histogram feature enabled and couldn't reproduce it.

ebsebs commented 6 years ago

OK, I'll try to come up with something in VM with loop devices. BTW, I do use unmodified upstream 4.14.21/x86-64 kernel in Gentoo, if it matters...

behlendorf commented 6 years ago

@ebsebs thanks. For what it's worth, there's nothing too special about using sparse files for vdevs, that might make it a little easier to work up a reproducer.

ebsebs commented 6 years ago

So far my attempts to reproduce this using sparse files directly or through loop devices were unsuccessful. Same goes for virtual SCSI disks under VMware Fusion... I'll keep trying

mabod commented 6 years ago

@behlendorf , @ebsebs

I was not able to reproduce with loop device but I was able to reproduce it 3 times with an arch linux installation in a virtualbox. I am using virutalbox 5.2.6 r120293 on Manjaro. Installed in virtualbox is arch linux xfce desktop. The steps to reproduce:

0.) Install arch linux + xfce + zfs in virtualbox and bring up-to-date. arch linux is setup to import pools during boot. 1.) In virtualbox create 4 additonal disc for the arch linux image: zt1 zt2 zt3 zt4. Each 2 GB in size. 2.) Boot arch linux and create the pool: zpool create testz mirror /dev/sdb /dev/sdc mirror /dev/sdd /dev/sde 3.) check with zdb -m testz that everything is ok 4.) copy some files to the pool: cp /sbin/* /testz 5.) poweroff 6.) in Virtualbox GUI remove disc zt4 and boot 7.) zpool status shows degraded pool. zdb -m testz is ok. 8.) copy more files: cp /bin/* /testz 9.) poweroff 10.) add disc zt4 again and boot 11.) zpool status shows that pool resilvered automatically 12.) delete some files: rm /testz/l* 13.) reboot 14.) zdb -m testz shows space map refcount mismatch: expected 105 != actual 102

I did created and destroyed the pool 3 times and could reproduce the behaviour 3 times. It looks like the issue is happening when I am deleting files from a freshly resilvered pool. Only after reboot the refcount mismatch occurs.

loli10K commented 6 years ago

arch linux is setup to import pools during boot.

@mabod "scan" import or "cachefile" import? Can you please post the full output of zpool history $poolname?

mabod commented 6 years ago

Its cachefile: `` History for 'testz': 2018-02-27.22:15:07 zpool create testz mirror /dev/sdb /dev/sdc mirror /dev/sdd /dev/sde 2018-02-27.22:16:35 zpool import -c /etc/zfs/zpool.cache -aN 2018-02-27.22:18:26 zpool import -c /etc/zfs/zpool.cache -aN 2018-02-27.22:21:59 zpool import -c /etc/zfs/zpool.cache -aN 2018-02-27.22:25:08 zpool import -c /etc/zfs/zpool.cache -aN

mabod commented 6 years ago

arch is setup with the following zfs related services:

zfs-import-cache.service                                                                 loaded active exited    Import ZFS pools by cache file                                               
zfs-mount.service                                                                        loaded active exited    Mount ZFS filesystems                                                        
zfs-zed.service                                                                          loaded active running   ZFS Event Daemon (zed)                                                       
zfs-import.target                                                                        loaded active active    ZFS pool import target                                                       
zfs.target                                                                               loaded active active    ZFS startup target

mabod commented 6 years ago

It is even easier to reproduce. Here is the command history of my last tests:

  223  zpool create testz mirror /dev/sdb /dev/sdc mirror /dev/sdd /dev/sde
  224  cp /bin/* /testz/
  225  zpool status # all ok
  226  zdb -m testz # all ok
  227  poweroff     # remove drive zt4 and boot
  228  zpool status # see degraded pool
  229  poweroff     # add missing zt4 and boot
  230  zpool status # see resilvered pool
  231  zdb -m testz # all ok
  232  reboot
  233  zpool status # see resilvered pool
  234  zdb -m testz # refcount mismatch showing up

So it has to do with resilver and reboot. There are no file operations involved while the pool is degraded or after it has been resilvered.

richardelling commented 6 years ago

By default, virtualbox virtual disks ignore cache flush commands. Since many filesystems, including ZFS, rely on cache flushes to work, you'll want to enable them and repeat your test. https://www.virtualbox.org/manual/ch12.html#idp47569020608528

rincebrain commented 6 years ago

@richardelling I mean, whether the disks ignore cache flush or not, they should flush completely before VM teardown, so that shouldn't really matter here?

mabod commented 6 years ago

@richardelling I followed your advice with the virtualbox cache flush. Setting VBoxManage setextradata "VM name" "VBoxInternal/Devices/ahci/0/LUN#[0-5]/Config/IgnoreFlush" 0 for all discs. But that is somehow breaking zfs functionality. With this setting zfs does not resilver anymore. It says that the drive is FAULTED and I have to do a zpool replace instead which is not giving refcount mismatches.

With the original settings I was able to reproduce the issue also with raidz1 also.

And I also followed @kpande advice and I exported the pools and removed the cache file before doing a zdb -me testz

richardelling commented 6 years ago

@rincebrain I think if cache flushes are ignored, then all bets are off because there is no way to verify the cache gets flushed.

@mabod this too is troubling. Since others have trouble recreating, perhaps testing with a solution other than VBox can eliminate one variable?

FWIW, I cannot reproduce with VMware Fusion.

mabod commented 6 years ago

I sacrificed one of my JBODs to do the test with my real PC. JBOD with 4 discs as RAID10, pool name testz like before.

I have executed the same steps as mentioned before and I was able to reproduce the refcount mismatch although with different zfs behaviour.

When booting with the degraded pool, the pool is not imported automatically. Instead zfs-import-cache.service is giving the following error:

Mär 01 07:44:29 rakete zpool[313]: cannot import 'testz': one or more devices is currently unavailable
Mär 01 07:44:31 rakete systemd[1]: zfs-import-cache.service: Main process exited, code=exited, status=1/FAILURE
Mär 01 07:44:31 rakete systemd[1]: zfs-import-cache.service: Failed with result 'exit-code'.
Mär 01 07:44:31 rakete systemd[1]: Failed to start Import ZFS pools by cache file.

I imported the pool manually. But did not do any operations with it.

poweroff; attach the missing device again and boot.

The pool is not automatically resilvered. I have to import manually. It says the a device is FAULTY:

 pool: testz
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
    invalid.  Sufficient replicas exist for the pool to continue
    functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: none requested
config:

    NAME                     STATE     READ WRITE CKSUM
    testz                    DEGRADED     0     0     0
      mirror-0               ONLINE       0     0     0
        sdf                  ONLINE       0     0     0
        sdg                  ONLINE       0     0     0
      mirror-1               DEGRADED     0     0     7
        sdh                  ONLINE       0     0     7
        3899446523646515953  FAULTED      0     0     0  was /dev/sdh1

errors: No known data errors

In this stage the pool is showing refcount mismatches.

zpool export testz
zdb -me testz
...
    metaslab    114   offset   7200000000   spacemap      0   free       4G
    metaslab    115   offset   7300000000   spacemap      0   free       4G

space map refcount mismatch: expected 12 != actual 11

mabod commented 6 years ago

More info:

The pool is not resilvering when reattaching the device. So i dettached the device from the pool and attached it again. This works. Pool is resilvering and ready. zpool clear testz needed .

But the refcount mismatch persists with higher numbers:

space map refcount mismatch: expected 19 != actual 18

By the way, I did this test multiple times also with device names by-id instead of sdh, sdi, etc.. And it makes no difference

ahrens commented 6 years ago

@mabod The problem reported by "space map refcount mismatch" is with the count of space maps that are using the spacemap_histogram feature flag. An incorrect count here has essentially zero impact on your pool. You can safely continue using the pool without consequence.

We'll investigate the problem. I suspect it has to do with DTL's, which are implemented with spacemaps, and exist on unhealthy pools. The first step will be to determine whether the problem is with the verification code in zdb, or with the actual kernel code. Could you send us the output of zdb -dddd <pool>?

ahrens commented 6 years ago

@kpande I've only skimmed this thread -- can you clarify how the (perhaps spurious) warning message from ZDB is related to a kernel panic (which is obviously much more concerning)?

ahrens commented 6 years ago

@kpande no, the space map refcount mismatch is not at all what freeing free segment or spacemap walk did not equal the allocated space means. They are totally orthogonal. The former is an issue with the feature flag refcount (happens to be the feature flag for spacemap_histogram), with the latter are issues with the contents of the spacemap (which is much more concerning).

mabod commented 6 years ago

@ahrens : I have exported the pool and uploaded the output of zdb -edddd testz to dropbox:

https://www.dropbox.com/s/pqyc07cqy1d472c/testz-zdb-dddd.txt.gz?dl=0

ahrens commented 6 years ago

@mabod Thanks for that. It looks like what's happened is that a space map object has been leaked. So the spacemap_histogram feature flag refcount is actually correct (there are 22 spacemaps using this feature), but ZDB only located 21 of them. The last one (object 6) doesn't seem to be referenced anywhere. I suspect that it's a DTL object that has been leaked. The leaked object is empty. There is no impact on the pool and you can safely continue using it (or other pools with similar errors). I'll continue investigating to see how this object could be leaked.

I'm assuming this pool was generated with the steps described in your previous comment: https://github.com/zfsonlinux/zfs/issues/7204#issuecomment-369134056

mabod commented 6 years ago

@ahrens : Thanks for the feedback. The output was generated from the JBOD pool I mentioned a few comments later. https://github.com/zfsonlinux/zfs/issues/7204#issuecomment-369494935

Please share with me how you did determine the issue about Object 6 since I want to debug my real productive pool the same way.

Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
     6    1   128K     4K      0     512     4K  100.00  SPA space map
                                               320   bonus  SPA space map header
dnode flags: USED_BYTES 
dnode maxblkid: 0

Is this the Object you are refering to? And why is Object 6 the "last one"? There are a lot more. And what do you mean when you say "leaked"? What does that term mean in this context?

ahrens commented 6 years ago

@mabod Yes, that's the object. I found it by matching up all the other SPA space map objects with the ones referenced by metaslabs or DTL's; see examples below.

"Leaked" means that this (space map) object in the MOS is not referenced by any other data structures in the pool. So there isn't any way for it to be freed. Assuming that it should have been freed, the only impact is a tiny bit of space (hundreds of bytes) being unnecessarily used.

Below, the spacemap XXX is the object ID of the spacemap for that metaslab:

Metaslabs:
        vdev          0
        metaslabs   116   offset                spacemap          free
        ---------------   -------------------   ---------------   -------------
        metaslab      0   offset            0   spacemap    139   free    16.0G
        metaslab      1   offset    400000000   spacemap      2   free    15.1G
        metaslab      2   offset    800000000   spacemap      3   free    16.0G
        metaslab      3   offset    c00000000   spacemap      7   free    16.0G
        metaslab      4   offset   1000000000   spacemap     10   free    16.0G

And below, the DTL: XXX is the object ID of the DTL for that vdev:

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
        58    1   128K    16K  4.50K     512    16K  100.00  packed nvlist
                                                 8   bonus  packed nvlist size
...
                children[0]:
                    type: 'disk'
                    id: 0
                    guid: 15069111281487308802
                    path: '/dev/disk/by-id/usb-WDC_WD20_EARX-00PASB0_152D00539000-0:0-part1'
                    devid: 'usb-WDC_WD20_EARX-00PASB0_152D00539000-0:0-part1'
                    phys_path: 'pci-0000:00:14.0-usb-0:8:1.0-scsi-0:0:0:0'
                    whole_disk: 1
                    DTL: 514

If you find the leaked objects on your production pool, could you send me the ZDB output for those objects too? They might provide a clue as to when/how they were leaked.

FYI, I've augmented ZDB to find these kinds of MOS object leaks automatically, and we'll be upstreaming that soon.

mabod commented 6 years ago

I am afraid I can not handle my own debugging. The zdb -edddd dump for my main pool is 3 GB. Is there any way to reduce the amount of debug info but still find what I am looking for?

ahrens commented 6 years ago

@mabod you can definitely remove everything after the MOS part of that zdb output (unfortunately I don't think there's a good way to tell it to dump the MOS only - that would be a nice, straightforward enhancement). Specifically, if you search for Dataset, you'll see the MOS and then the dataset with the same name as the pool. You can cut everything after that. For example:

...
Dataset mos [META], ID 0, cr_txg 4, 105K, 64 objects, rootbp DVA[0]=<1:400010400:200> DVA[1]=<0:1000010400:200> DVA[2]=<1:1a00010000:200> [L0 DMU objset] fletcher4 lz4 LE contiguous unique triple size=800L/200P birth=153L/153P fill=64 cksum=845a87cf3:3615bb07b1d:b4283f1d7d3f:197c12dafded98
...
Dataset testz [ZPL], ID 51, cr_txg 1, 1.86G, 5809 objects, rootbp DVA[0]=<1:40000a400:200> DVA[1]=<0:100000a400:200> [L0 DMU objset] fletcher4 lz4 LE contiguous unique double size=800L/200P birth=148L/148P fill=5809 cksum=1059db9b91:573cfbd47e7:f65ef305870c:1e7a7968d3a443
 ====== remove everything after the above line ===========
...

mabod commented 6 years ago

@ahrens : I created a python script to strip the zdb output. It is taking everyting between Dataset mos [META] and Dataset poolname [ZPL]. That still creates a file of 3 GB. Not much is left out.

Here is the python script I use. The logfile is coming via STDIN, the poolname is a commandline argument:

#!/usr/bin/python

import sys
import re

if(len(sys.argv)<=1):
    print("pool name missing")
    sys.exit()

start = "Dataset mos \[META\]"
end   = "Dataset " + sys.argv[1]   + " \[ZPL\]"

for line in sys.stdin:
    if( re.search(start, line) ):
        print("Found start: " + start)
        sys.stdout.write(line)
        for line in sys.stdin:
            sys.stdout.write(line)
            if( re.search(end, line) ):
                sys.stdout.write("Found end: " + end)
                sys.exit()

shodanshok commented 6 years ago

@ahrens @mabod any news on the matter? I have two pools which shows the very same problem.

The first shows: space map refcount mismatch: expected 370 != actual 328

and the second one: space map refcount mismatch: expected 138 != actual 114

What can/should I report to aid in the debug? From what I read above, full debug output is going to be massive (in the order of few GBs). It is possible to estimate how much free space (in bytes or blocks) leaked?

ahrens commented 6 years ago

@shodanshok This message doesn't indicate that space is leaked. There's no practical impact of this bug.

shodanshok commented 6 years ago

@ahrens Thanks for reply. I was assuming some free space was leaked/lost based on you previous reply:

Assuming that it should have been freed, the only impact is a tiny bit of space (hundreds of bytes) being unnecessarily used

and based on that reddit reply by txgsync

So, I misunderstanded something? While the message cause no harm to the pool, I would really like to understand what it means and why it shows (side note: I also tried with an exported pool, with identical results).

starmood commented 5 years ago

I got same "space map refcount mismatch:" error, but can't find any solution yet.
So after one year discussion, is there any way to get rid of this error now? Creating a new pool should not be like a solution for production server.

starmood commented 5 years ago

aha, this is really a good solution.... I had some scripts to get dedup information by zdb. But with error, the return code always none-zero. That will make script confused.

jerzydziewierz commented 4 years ago

bump. I have this bug in 2020-06-30

bonze82 commented 4 years ago

Bump. I have this problem as well with all of my production devices including disk images that are single dev pools with no issues.

openzfs / zfs

Question about "refcount mismatch" #7204

System information

Describe the problem you're observing