Allow some zpool commands to timeout

h1z1 commented 2 years ago

Describe the feature would like to see added to OpenZFS

It can be extremely helpful to obtain troubleshooting information from a system crashing or on the verge of it. Problem is some tasks will hang when they could timeout.

How will this feature improve OpenZFS?

By allowing a command like zpool events to timeout (with an appropriate error), it can be the difference between a complete system failure and recovery. I'd imagine there are other similar commands like zpool status, iostats, etc.

In zpool history's case there's even a comment in the code noting history is async thus the need for a txg sync:

    /*
     * The history is logged asynchronously, so when they request
     * the first chunk of history, make sure everything has been
     * synced to disk so that we get it.
     */
    if (*offp == 0 && spa_writeable(spa))
        txg_wait_synced(spa_get_dsl(spa), 0);

That logic seems backward. The log should be atomic on creation / submission, async on read no? Otherwise how is consistancy kept if the system is not able to do sync? spa_history_lock is held on read.

Additional context

I expect there is no one fix for all but in the case of zpool history, when the pool is unable to sync due to other issues, the stack becomes:

[<0>] cv_wait_common+0xb2/0x140 [spl]
[<0>] __cv_wait_io+0x18/0x20 [spl]
[<0>] txg_wait_synced_impl+0xdb/0x130 [zfs]
[<0>] txg_wait_synced+0x10/0x40 [zfs]
[<0>] spa_history_get+0x29a/0x2e0 [zfs]
[<0>] zfs_ioc_pool_get_history+0xfe/0x150 [zfs]
[<0>] zfsdev_ioctl_common+0x7db/0x840 [zfs]
[<0>] zfsdev_ioctl+0x56/0xe0 [zfs]
[<0>] do_vfs_ioctl+0xaa/0x620
[<0>] ksys_ioctl+0x67/0x90
[<0>] __x64_sys_ioctl+0x1a/0x20
[<0>] do_syscall_64+0x60/0x1c0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

It will never return due to the failure.

rincebrain commented 2 years ago

I believe this is trickier than one might imagine - see #11082 for how invasive the changes needed may be to discard things in flight.

h1z1 commented 2 years ago

Indeed it won't be a simple switch from everything being sync to async but there is hope. LUA for example https://github.com/openzfs/zfs/pull/8904. Events and history seem like rather simple cases, other things like zpool add/replace/etc are of course going to need more thought. Maybe the low hanging fruit is to make zfs/zpool commands themselves timeout aware?

The major issue that you can't kill them from the shell so it hangs a lot of things. In the case of multiple pools in one system it can completely kill unrelated ones.

Related is the use of spinlocks period. A side effect of the failure related above was the CPU spins to the point the kernel thinks it's stuck (NMI watchdog isn't enabled).

rincebrain commented 2 years ago

The source for mutex.h has a comment about how Linux mutexes don't promise serialization in some edge cases, but the semantics elsewhere in the OpenZFS codebase assume the mutex type does, so they need to implement certain codepaths with a spinlock specifically to provide the additional guarantees.

If you can find a better solution, I'm sure it'd be welcome.

On Fri, May 6, 2022 at 11:32 PM h1z1 @.***> wrote:

Indeed it won't be a simple switch from everything being sync to async but there is hope. LUA for example #8904 https://github.com/openzfs/zfs/pull/8904. Events and history seem like rather simple cases, other things like zpool add/replace/etc are of course going to need more thought. Maybe the low hanging fruit is to make zfs/zpool commands themselves timeout aware?

The major issue that you can't kill them from the shell so it hangs a lot of things. In the case of multiple pools in one system it can completely kill unrelated ones.

Related is the use of spinlocks period. A side effect of the failure related above was the CPU spins to the point the kernel thinks it's stuck (NMI watchdog isn't enabled).

— Reply to this email directly, view it on GitHub https://github.com/openzfs/zfs/issues/13427#issuecomment-1120124000, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABUI7MFFYW46RIPF2VLCQDVIXP4FANCNFSM5VHN2JNA . You are receiving this because you commented.Message ID: @.***>

h1z1 commented 2 years ago

Bit misuse of words on my part sorry, I meant they have a place of course but could be tweaked. Would it not make more sense in the case above for example to either return an error or avoid the txg sync entirely? There's still an underlining issue with how the pool got into that state but it would allow some further investigation. S'pose another option is expose them in procfs?

openzfs / zfs