Another deadlock on unmounting filesystem

lukemarsden commented 10 years ago

Complete zpool lockup again. The system was doing some concurrent ZFS operations, possibly on the same filesystem. This time using the latest trusty daily:

hybrid@node002:~/hangs/2014-05-05-hang-after-live-migration$ sudo modinfo zfs |head
filename:       /lib/modules/3.8.0-38-generic/updates/dkms/zfs.ko
version:        0.6.2-2.1~trusty~13.gbp7d6f55
license:        CDDL
author:         Sun Microsystems/Oracle, Lawrence Livermore National Laboratory
description:    ZFS
srcversion:     8EAC8D11923CD437A7B0573

http://hybridcluster.net/2014-05-05-hang-after-live-migration/syslog.txt (see after timestamp May 5 13:08:18).

behlendorf commented 10 years ago

The zfs utility deadlocked with the txg_sync thread while dropping a hold in the onexit handler. Here's what happened:

task txg_sync:1549

txg_sync_thread
spa_sync
dsl_pool_sync
dsl_sync_task_sync
dsl_dataset_user_hold_sync
dsl_onexit_hold_cleanup
zfs_onexit_add_cb
zfs_onexit_minor_to_state
zfsdev_get_state (blocks on zfsdev_state_lock)
mutex_lock

task zfs:4738

zfsdev_release (holds zfsdev_state_lock)
zfs_onexit_destroy
dsl_dataset_user_release_onexit
dsl_dataset_user_release_tmp
dsl_dataset_user_release_impl
dsl_sync_task
txg_wait_synced (blocks on syncing txg)

The zfs process is waiting for a the txg to finish syncing while holding the zfsdev_state_lock. Meanwhile the sync thread can't make progress and finish the sync until the zfs process drops the zfsdev_state_lock. Deadlock.

We probably haven't seen this before because most users haven't made extensive use of zfs holds. That said, these two call paths are basically the same under Illumos so it's not immediately clear to me why no one has reported this issue there. Regardless, we'll need to get it fixed!

dweeezil commented 10 years ago

I have a feeling it's more likely this is fallout from the restructured sync task of zfsonlinux/zfs@13fe019 rather than more frequent use (directly by users) of holds.

@lukemarsden You mentioned this deadlock occurred during a bunch of concurrent ZFS operations: do I gather correctly that a zfs hold was not one of those operations? Have you got an idea of what operations were being performed? Was a zfs diff or zfs send one of them?

I have a feeling the hold operations involved with this are the ones performed internally.

dweeezil commented 10 years ago

@lukemarsden One other question. After rebooting to recover from the deadlock, are there any temporary snapshots left around (they have names ending with "-" followed by 16 digits). Also, it might be interesting to see a slice of zfs history -i from the time period leading up to the deadlock.

EDIT: Github markup initially ate part of my description of temporary snapshot names.

lukemarsden commented 10 years ago

@dweeezil Thanks! Yes, we don't do any zfs holds directly, but we do plenty of zfs send (no zfs diff at this time). You're correct that they were internal holds.

I don't see any temporary snapshots on the affected system, neither with zfs list or zdb -d <poolname>.

Unfortunately the zpool history -i results have since spooled out of existence, next time I capture a deadlock I'll capture the zpool history -i just beforehand too and include this.

lukemarsden commented 10 years ago

One other fact that is slightly unusual about this pool: it's backed onto a file:

hybrid@node002:/var/crash/201405070111$ sudo zpool status
  pool: hpool
 state: ONLINE
  scan: none requested
config:

        NAME                STATE     READ WRITE CKSUM
        hpool               ONLINE       0     0     0
          //hpool-datafile  ONLINE       0     0     0

errors: No known data errors

We're currently working on stress testing on a real partition-backed pool, to see if we can reproduce the same problems there.

lukemarsden commented 10 years ago

Just adding a note from an email thread with @ahrens:

The stack traces in the bug report indicate that the problem is with some Linux-specific driver code (zfsdev_get_state):

May  5 13:08:18 precise64 kernel: [ 2994.761348]  [<ffffffff816f3fda>] mutex_lock+0x2a/0x50
May  5 13:08:18 precise64 kernel: [ 2994.761371]  [<ffffffffa01ba856>] zfsdev_get_state+0x26/0x50 [zfs]
May  5 13:08:18 precise64 kernel: [ 2994.761391]  [<ffffffffa01bd018>] zfs_onexit_minor_to_state+0x18/0x30 [zfs]
May  5 13:08:18 precise64 kernel: [ 2994.761410]  [<ffffffffa01bd253>] zfs_onexit_add_cb+0x33/0xf0 [zfs]
May  5 13:08:18 precise64 kernel: [ 2994.761429]  [<ffffffffa01e7e96>] dsl_onexit_hold_cleanup+0xb6/0x100 [zfs]
May  5 13:08:18 precise64 kernel: [ 2994.761447]  [<ffffffffa01e7fc6>] dsl_dataset_user_hold_sync+0xe6/0x1b0 [zfs]
May  5 13:08:18 precise64 kernel: [ 2994.761464]  [<ffffffffa0171e32>] dsl_sync_task_sync+0xf2/0x100 [zfs]
May  5 13:08:18 precise64 kernel: [ 2994.761480]  [<ffffffffa016a393>] dsl_pool_sync+0x2f3/0x420 [zfs]
May  5 13:08:18 precise64 kernel: [ 2994.761499]  [<ffffffffa0180484>] spa_sync+0x414/0xb20 [zfs]

behlendorf commented 10 years ago

@lukemarsden Yes, that looks like the critical bit. Under Illumos the zfs_onexit_minor_to_state function uses zfsdev_get_soft_state to do this lookup locklessly. That's why Illumos doesn't see this and how we accidentally inherited it.

This code needed to be rewritten for Linux because Illumos relies on support provided by ddi_get_soft_state. That interface isn't provided by the Linux kernel so we wrote our own implementation which takes this lock. As an aside none of the ddi_* functions are available in the Linux kernel. Longer term it would be a good thing to hide behind a portable interface.

lukemarsden commented 10 years ago

Thanks @behlendorf! @ryao mentioned he may be able to help us on Friday, so perhaps he could have a crack at that unless you or someone else get there first :-)

behlendorf commented 10 years ago

The issue looks straight forward enough. Likely the best fix is to implement zfsdev_get_soft_state and use that. We just need a developer with some time to get it implemented.

dweeezil commented 10 years ago

I wonder if it would be worth implementing a few more of the DDI soft state functions in the SPL? That would allow the code to more closely match the upstream. It looks like the DDI suite stores the minor states in an array to allow for lock-free reading (which seems like a good idea).

lukemarsden commented 10 years ago

FWIW, I have reproduced the same issue on dedicated hardware with a zpool mirror on partitions on two physical drives: http://hybridcluster.net/zol-hangs/201405081202/

hybrid@ns410338:/var/crash$ sudo zpool status
  pool: hpool
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        hpool       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sda3    ONLINE       0     0     0
            sdb3    ONLINE       0     0     0

errors: No known data errors

This eliminates the issue being to do with being a file-backed pool. Built from the same (latest) version of the source (2c33b91).

@dweeezil - seems sensible to reimplement these functions for lock-free reading, if in fact this is causing these deadlocks. How much work do you think that would that be?

dweeezil commented 10 years ago

Another idea that came to my mind would be to simply not delete items from the (zfsdev_state) list, but instead to mark them as available for re-use (maybe by setting the minor to -1). That should also allow for lock-free reading. The list isn't likely to ever be too long to begin with so it's not like it will waste a lot of memory. Although implementing more of the Solaris DDI interface sill might be a good idea, this might be a quicker fix.

dweeezil commented 10 years ago

I hacked up a patch at dweeezil/zfs@cb83c05 which implements the "don't delete" idea. It passes light testing but I'll not have a time to review it in greater detail until later this evening.

lukemarsden commented 10 years ago

@dweeezil Thanks! So far this seems to have solved the issue for us. Awesome!

ryao commented 10 years ago

To summarize the problem above, this deadlock is caused by lock inversion. In an abstract sense, the deadlock occurs because thread 1 locks A and then thread 2 locks B while another locks B and then A. @dweeezil's patch works by making thread 2 avoid locking B where the data structure being protected is modified in a lock-less fashion, which fixes it. Here, thread 1 is the kernel txg_sync thread, thread 2 is the userland zfs utility, lock A is the transaction sync and lock B is zfsdev_state_lock.

This should be a problem on both FreeBSD and Illumos, but it seems to have gone unnoticed until now. Linux's memory allocation routines attempt to reclaim memory under low memory conditions. That has been a source of deadlocks involving swap on ZFS, but it has the secondary effect of causing operations involving memory allocations to randomly take more time than usual. That can make races such as this more likely to trigger a deadlock, which is what I suspect happened here.

@dweeezil's patch changes writes to occur in a way that lets readers simply read without any fear of repercussions. There are plenty of subtle things that happen to be just right to make that work. We currently only traverse forwards, and modification happens under a lock, which functions as a memory barrier on x86. The way that Linux modifies the list means that any readers will either see a change to the list or won't, but we are implicitly relying on the lock to function as a barrier to do that.

I have a few criticisms of this approach:

It would be rather easy to cause regressions to happen with the code structured this way, because it is not immediately clear that the list must only be traversed forward. I need to think about how this could be done in a way that minimizes the potential for such a thing to happen, but the patch looks safe to me as far as production use right now is concerned. If we settle on this approach, we should modify the list to stop being doubly linked as a safety precaution and include a comment saying that it must never be traversed in reverse because of the lockless nature of this code.
Atomic instructions are not a memory barrier on Alpha, ARMv7, POWER and IA-64 CPUs. I spot checked a few architectures (POWER and Alpha) and it appears that the mutex code will explicitly call memory barriers on those architectures. However, we will see some very hard to diagnose bugs should that ever not be the case. We should design a splat test that will only fail whenever locking does not function as a memory barrier.

dweeezil commented 10 years ago

I'll add a couple of notes about my patch: I hadn't dealt with the Solairs-compatible list library provided by the SPL and had wrongly assumed it was only singly-linked. I suppose I should have realized it wasn't after using the "delete from tail" in the destructor. I was also concerned about the atomicity of the read operations.

Since this list is essentially a substitute for a variable-sized array, I'll rework the patch to do a (singly) linked list on its own without using the list library from SPL.

behlendorf commented 10 years ago

That sounds like a reasonable approach. It would be great if you could open a pull request when you're ready with the reworked patch.

dweeezil commented 10 years ago

@lukemarsden Could you please try the new version in #2323. It implements the same idea but dispenses with the use of the doubly-linked list library and adds some comments to the data structure. I've not developed a reproducer for this problem and have only run basic tests (as was the case with the original patch).

openzfs / zfs

Another deadlock on unmounting filesystem #2301