koverstreet / bcachefs

Other
643 stars 71 forks source link

Various bugs [YellowOnion's Megathread] NOCOW LEAK FOUND #616

Closed YellowOnion closed 7 months ago

YellowOnion commented 8 months ago

nocow lock leak

I think I've found the nocow leak, there's a fix in my bcachefs branch, not sure if the solution is correct but the general idea should work.

data jobs not moving all data

This needs triage, it's effects all data jobs, and creates annoying interactions (see later) maybe some interactions between missmatched bucket sizes?

background jobs don't fire BCH_ERR_insufficient_devices

As you know I only have 5 background devices and those pesky 5 replica entry is spinning in rebalance now that I've turned off nocow locks. this should be part of rereplicate etc so we can signal the user before it's finished that some extents couldn't be moved.

rebalance needs to signal fail.

This probably needs at least a tracepoint, But better handling of failure modes, would be nice, maybe mark extents that we can't move so we don't on the same 20 extents endlessly.

        /* skip it and continue, XXX signal failure */

drop_extra_replicas doesn't drop extra btree replicas (possibly effects data path too)

I suspect we're only downgrading to replicas = replicas, and not doing replicas = 1 when durability on the ptr is 2.

move deallocate_extra_replicas() to before BCH_ERR_insufficient_devices in foreground path.

YellowOnion commented 7 months ago

Bugs have been found, all my theories were wrong!