oetiker / znapzend

zfs backup with remote capabilities and mbuffer integration.
www.znapzend.org
GNU General Public License v3.0
604 stars 136 forks source link

ZFS destroy snapshot causes hung_task panic #618

Closed Harvie closed 2 months ago

Harvie commented 6 months ago

Hello, when i have too much (thousands) of old snapshots accumulated for some reason and znapzend tries to destroy them all the same time, ZFS might take 45+ minutes to finish the operation while the pool being unresponsive. This can in a worst scenario lead to linux kernel hung_task panic being triggered or system being otherwise unresponsive in case hung task timeout is disabled.

I've kinda managed to solve this by using --features=oracleMode, which destroys the snapshots one by one. Which gives the OS few moments to release internal ZFS locks and finish other operations on that pool in the meantime. This seems to prevent system being unresponsive or panic.

BUT it makes removing the snapshots ~10 times slower over all. So i was thinking it would be really cool to have some configurable batch size. Where i would be able to specify how many snapshots will znapzend destroy at the same time. (currenly we can destroy 1 with oracle mode, or all without oracle mode). I would love to be able to set some arbitrary batch size like 42. to fine tune between reliability and performance. I think it would make sense to use such limit to all other bulk ZFS operations (i am not sure if znapzend currently does batch processing of anything else than destroy, but you get the point...).

Also it was not immediately obvious that i might need OracleMode for znapzend to run smoothly on Linux. So having option called "MaxBatchSize" or something like that might be bit more self explanatory.

Harvie commented 6 months ago

also zfs destroy has following option:

       -d  Destroy immediately.  If a snapshot cannot be destroyed now, mark it for deferred destruction.

But i am unable to find any docs on what this actually does. It would be cool if znapzend could just mark the snapshots for later removal and it would be actualy removed on background during scrub or something to alleviate system load. But i have gut feeling that this does something completely different (like deferring the destroy till last process stops using that snapshot without blocking zfs destroy call)

Harvie commented 6 months ago

Also i've found there is some com.delphix:async_destroy feature flag that can be set on ZFS. Maybe that might fix the problem as well. But i need to try.

stale[bot] commented 4 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Harvie commented 4 months ago

Dear @stale go f*ck yourself.

stale[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.