Open sdettmer opened 7 months ago
@jimsalterjrs i think we can remove the whole send/recv check. zfs destroy will error out if the snapshot is in use and if we need to know the reason why it failed we can capture the output.
What do you think? If you are okay with it a will create a PR for it.
@jimsalterjrs i think we can remove the whole send/recv check. zfs destroy will error out if the snapshot is in use and if we need to know the reason why it failed we can capture the output.
What do you think? If you are okay with it a will create a PR for it.
Sounds fine... I knew this was belt-and-suspenders, but did not expect iszfsbusy() to cause so much of a slowdown. Is the slowdown our own fault, or is something upstream performing less-well than it ought to? If the latter, we should probably file a bug upstream despite "fixing" it on our end by not bothering to check.
Thanks for offering to create the PR!
I think the performance impact is just to big when having "empty" snapshots (like sanoid frequently during the nights). Destroying them happens in an instant, but ps -Ac
on a node with lets say 5000 processes takes ~500ms (I just tested on two randomly picked systems).
(Normally with a couple of snaps to be deleted, it should have no big effect, but when new people come to sanioid, they may do so because of an issue, like facing slowness and finding 100.000 snaps, so they fix the rules and then the performance impacts. I think I found exactly one such issue in the internet, where someone proposed a script being "ten times faster" (IIRC) possibly for the same reason.)
But for @phreaker0 I have to state that I have no patch yet! I think changing the default value shouldn't be too difficult and I think could prepare a patch proposal. Or did you already took a look? What do you all think?
Indeed the change seems very straightforward (I hope I did not miss anything).
I see three major approaches and I made a proposal patch for each:
--force-prune
default and add --no-force-prune
as option to restore the old behavior--no-force-prune
is irritating, so it might be renamed to --double-check-prune
--no-force-prune
or --double-check-prune
(doing this later instead of right now possibly could be too much safety, but I work in an industry with 20 years warranty periods so I'm used to think this way :))That are alternatives - of course one and only one could be merged
I'm happy to receive feedback for the choosen option and to improve my proposal according to your review!
I noticed that on my machine sanoid appears a little slow. I reduced snaps to free a bit of space and first run sanoid got a bit to work on. I noticed that deletion takes ~100ms per zfs destroy, but sanoid needs ~500ms for each. I think because iszfsbusy() takes much time, and it is not executed once per data set, but for each snapshot (I think it is a race anyway).
Do I understand correctly
--force-prune
skips that check (and does nothing else)? Under which circumstances does it help?With
--force-prune
, I get more than5 x the speedEDIT: 10 x the speed and I think it is still safe unless the expiry rules would be so short that they would expire recent snapshots which could be used by replication (and then, if send would start a fraction of a second later, would be lost anyway). Do I understand correctly that this check normally should not be needed, and even if zfs send would be running, there would only be an error message but no disturbance? Do we still need this or could we make--force-prune
the "default"?