openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.64k stars 1.75k forks source link

Formal Methods and more Tests for ZFS Code #15565

Open mtippmann opened 11 months ago

mtippmann commented 11 months ago

Describe the feature would like to see added to OpenZFS

Despite all the hard work and review possible data loss bugs seem to creep into the codebase.

I'd love to see the introduction of formal methods in the ZFS codebase. Some possible ways could be using Alloy Analyzer for checking things on a conceptual level - see here for an example: https://dominik-tornow.medium.com/kubernetes-api-server-part-i-3fbaf2138a31 another idea would be using tooling to attempt to prove some properties of the code paths (like locking or dbufs, everything is probably impossible or too complicated) - https://frama-c.com/ looks also promising - there also seems to be work at the linux kernel: https://www.ispras.ru/en/technologies/astraver_toolset/. edit: it's far too complicated / cumbersome.

Additionally it looks like neither the btrfs-tests nor the xfstests are running in the CI at the moment? Maybe adding these could trigger some edge-cases current tests don't hit.

Another thing is coverage of all code paths - there are so many different switches - compression, arc-compression, dmu-sync behavoir, features - testing these systematically and at best online (what happens if parameters get changed online) might also be a worthwhile goal in itself.

How will this feature improve OpenZFS?

It's possible to catch bugs early or in the best case avoid them at all.

Additional context

Of course this is a "if one could dream" request as I lack the knowledge to implement this - and it's additional complexity and costs that need to be financed in some way and maybe in the case of using formal methods like frama-c and proves about properties the payoff might not be worth the work required to implement this - additionally it can only check the assumptions you put it and can't avoid checking the wrong thing or missing something. Maybe there is some middle-ground where the rewards are high enough to warrant the work.

I'd just thought I'd open this ticket here for brain storming - a first step could be making btrfs and xfs tests runnable?

Thanks to everyone working on ZFS - I'm sure the devs are aware that these things exists but considering the uncertain situation regarding data loss bugs at the moment I'd thought I'd just post this here as is.

robszy commented 11 months ago

For me xfstest should be added to CI long way ago especially if we have dataloss bugs like found in gentoo building go: https://github.com/openzfs/zfs/issues/15526 and from 2022.

I didn't see any of dataloss bugs in ext4 or other linux kernel so something is really wrong...

FL140 commented 11 months ago

I didn't see any of dataloss bugs in ext4 or other linux kernel so something is really wrong...

While I agree that there is something really wrong here regarding testing as the last week shows, I also had dataloss bugs on other FS. E.g. XFS where I ended up the FS saying everything is fine while corrupting files leaving them with all zero content. This was the reason for me switching to ZFS in the first place many years ago.

Ironically now we have also 00h chunks corrupted files but much harder to identify corrupted files as those are not for sure all 00h content (AFAI read in the threads).

I can only hope that the events of last week lead to take testing MUCH more serious in the future regarding of the potential impact a FS module can/does have on important data.

At least I really woukd like to see more heavy parallel load testing with standard commands like cp and mv, etc. This are basic functions and one should be able to rely on that this kind of stuff was at least tested with current and upcomming major Linux Releases like Debian, Ubuntu, RedHat, etc.

As the last week shows a simple script like the one that came up last week would have avoided this massive problems we see now.

robszy commented 11 months ago

@FL140 When that was with xfs ? I understand many years ago ?

stuartthebruce commented 11 months ago

Another random idea (if not already the case) to help with catching device performance sensitive issues like race conditions would be to run the full set of CI tests on multiple pools with different performance devices, e.g., HDD, SSD, NVMe and ramdisk.

robn commented 11 months ago

If anyone is interested in a project, the single most useful thing you could probably do is get xfstests running against ZFS again. The work is mostly re-adding awareness of ZFS to the latest version, and updating and bringing our old patches back (see #5481), and ideally getting the whole lot upstream. Its probably a couple of weeks of evenings and likely doesn't require much deep ZFS knowledge.

(Alternatively, let me know if you're interested in sponsoring this work. I desperately want this to happen, and I do poke at it from time to time, but not enough hours in the day).

mtippmann commented 11 months ago

Talked to my teacher on the course where we had formal methods using Ada/Spark and he basically dismissed the idea to use frama-c to proof anything - it's far too cumbersome and complicated.