openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.7k stars 1.76k forks source link

Criteria for re-enabling block cloning (toggle `zfs_bclone_enabled`) by default #16189

Closed johnpyp closed 1 month ago

johnpyp commented 6 months ago

(Apologies if this is covered somewhere else, I didn't see any issues or discussions related to it)

Describe the feature would like to see added to OpenZFS

Turn on or remove zfs_bclone_enabled, given that most of the large issues have seemingly been resolved.

How will this feature improve OpenZFS?

It will enable block cloning for everyone, of course! :)

Additional context

So, my understanding is that zfs_bclone_enabled was added as a stop-gap measure to make sure that the corruption related to block cloning, what's the criteria by which ZFS can either enable this by default or remove it?

Following the many recent PRs related to block cloning, it seems that all(?) of the outstanding corruption issues have been resolved (please correct me if this is wrong).

Given the severity of the issue at the time, I can understand the hesitance to re-enable it by default, so I'm mostly wondering what the threshold is. Does it need "bake time"? Does it need higher profile customers using it? Or, is it actually ready to get re-enabled now?

rincebrain commented 6 months ago

You're wrong.

See #16025 and #15615 for some more context.

rrevans commented 6 months ago

@rincebrain Hmm - do you have insights on how those PRs are related to BRT? To my knowledge, both of those are trying to improve llseek vs. fixing real bclone bugs?

rincebrain commented 6 months ago

Unless I've misplaced the thread entirely, they still fell into the header of "yes, it's not a problem with BRT, but triggering the problem in question with BRT is substantially easier", no?

Yes, the problem is not technically BRT's code, but I still wouldn't turn it on until that's moot.

(I could have also pointed to #15139, those were just what came to mind first.)

rrevans commented 6 months ago

For #15526, as I recall, coreutils-9.2 was the main factor give or take? That changed copying behavior so that llseek(SEEK_HOLE/SEEK_DATA) vs. sync ctx races triggered more often. bclone/BRT was a red herring, I believe. To that end, both #16025 and #15615 should be unrelated to BRT and out of scope for enabling bclone again. (But as aside: llseek still has possible data races that need fixing, even on x86.)

As for collecting bclone blockers proper, I'll nominate both #16012 and #16014. There should probably be a github Project created to collect and track the blocking open FRs/PRs/Issues.

To the stability criteria, that's a question to pose to the leadership group. I'd suggest the project needs to set quantifiable success metrics in conversation with the authors and contributors such as:

  1. Some time period without a major bug reported (defined as data loss or other functionality broken), e.g. 90 days.
  2. Specific stress testing activities to perform e.g. a. run ZTS 500 times with bclone enabled by default b. run gentoo reproducer workload for 200 hours
  3. Gather some number of reports from users deploying to important workloads for some length of time e.g. 90 days in use for a VM hosting workload, 180 days in use as a file server workload, etc.
rincebrain commented 6 months ago

15526 technically isn't coreutils-related at all, because Gentoo wrote their own bespoke copy function in portage, IIRC, so it would reproduce without new coreutils as well.

IIRC, that one was primarily the problem with zn_has_cached_data being racey, which bites you in both the llseek and not cases, since zpl_clone_range depends on the same check as zfs_holey_common (the backing function for SEEK_HOLE/SEEK_DATA checks).

rincebrain commented 6 months ago

Hey @robn @thesamesam I imagine you both have opinions on this.

johnpyp commented 6 months ago

Thanks both for the info, agree on the suggestion for a github project to track the ongoing blockers... I did not realize there were so many (potentially) blocking issues, and will stop turning it back on on my pools 😓

That kind of stabilization criteria sounds good too. I'd hope for point 3 that this can be timeboxed appropriately and have explicit buy-in from companies/whoever, such that it doesn't end up stretching out over long periods of time needlessly due to flaking on the testing / deprioritized the work / etc.

MichaelDietzel commented 6 months ago

I maybe have an idea how I could do some stress testing: A few months ago I implemented a simple rust tool for block based deduplication using FIDEDUPERANGE. It is still more or less experimental and needs a lot of polishing but it works for me and it is pretty fast. I mostly like to use it to deduplicate vm-images on a zvol formatted with xfs, so there are many blocks being deduplicated. My hope was to continue working on it once ZFS block cloning is stable and get rid of the extra xfs-layer. I'd also like to add cross-dataset-deduplication and (if supported by ZFS) deduplication between datasets and zvols. Or maybe even deduplicating between already existing snapshots? So I cannot use FIDEDUPERANGE for this. But what else do I use? Maybe someone can give me a hint where to start. Would I need to modify and compile ZFS by myself, or is there some kind of interface that ZFS provides that I could use with my "external" tool? I'd prefer an interface similar to FIDEDUPERANGE that doublechecks that the data are really identical so that I cannot do anything stupid or maybe just get it wrong because of a race condition. Should I maybe open a separate issue for this? Thanks!

robn commented 6 months ago

@MichaelDietzel #11065 is the existing FR for FIDEDUPERANGE, and #15393 has an prototype implementation. I suspect that once finished, it would be a small lift to wire up an alternate non-kernel entry point to support different kinds of targets. We need the functionality first though, so those tickets are probably where to focus your efforts initially.

robn commented 6 months ago

Since I was asked directly: I personally consider block cloning good/safe enough to be enabled since 2.2.4. A lot of work has been done to stabilise it since 2.2.2 especially; I am not aware of any substantial outstanding issues directly involving block cloning; I have it enabled on the few machines I maintain; I have suggested to certain sysadmins that know and trust me that it's safe-enough to enable; and, were I still in charge of a fleet of 100+ machines (previous life), I would be enabling it there.

However, I recognise that I'm in a position that many others are not, in that I run my systems with a ton of redundancy and I have the skills to understand and fix problems when they come up, so I can afford to be somewhat blasé about. That said, I've not had to roll up my sleeves in this way on anything related to block cloning.

So I stop short of a full-throated recommendation, but if asked I do say "it's probably fine". If there was a rough show-of-hands consensus vote on whether or not to enable it for 2.2.5 and I was at the table, I would say yes.

(I have no idea if such a vote is useful or wanted, or if I would be invited; I'm just trying to get across where my gut feel is).

I agree that more specific process for deciding on stability generally would be great, and I'll support anyone who wants to push on getting buy-in and implementing it, but I won't be taking point on that kind of project this year at least - too many other plates in the air at the moment. In the absence of such a thing, I suspect something close to "rough show-of-hands consensus vote" is about the best we can hope for at the moment.

tonyhutter commented 6 months ago

While not a bugfix, we may want to wait for https://github.com/openzfs/zfs/pull/16208 before re-enabling block cloning

behlendorf commented 6 months ago

There should probably be a github Project created to collect and track the blocking open FRs/PRs/Issues.

Apparently GitHub doesn't want to create new projects today so I've created a BRT label instead and applied it.

elahn commented 5 months ago

For reference, to turn on block cloning on Linux for all pools with feature@block_cloning enabled, the "tunable" is a zfs kernel module parameter: /sys/module/zfs/parameters/zfs_bclone_enabled

See: Performance and Tuning / Module Parameters

cburroughs commented 4 weeks ago

To clarify, are all of the fixes that make 91bd12dfeb9abe432530615eaf2295d76da24b54 true already present in an existing release?