ocurrent / obuilder

Experimental "docker build" alternative using btrfs/zfs snapshots
Apache License 2.0
60 stars 17 forks source link

Detecting ZFS snapshots #60

Closed patricoferris closed 3 years ago

patricoferris commented 3 years ago

This is an issue to track a potential bug in the ZFS code. I'm saying potential because it could just be something I'm doing wrong, something wrong with ZFS for macOS, or an actual problem. Whichever it is I thought it best to record it and I can test it later on other platforms to see if it which it is. Currently (as part of enabling MacOS support with a ZFS backend #57) I'm taking the implementation for a spin with opam-health-check.

One problem I was finding (and it took a while to diagnose) was that when the builder was building new jobs it tried to detect snapshots to see if it could restore from them instead of rebuilding things but could not find them. It would then try to build a new snapshot and fail because, in actual fact, the snapshot was still there.

The code in question is ZFS_store's implementation of getting the result. It would seem that in a very sporadic and hard to reproduce way the "directory" tank/result/<hash>/.zfs/snapshot could just disappear, or be filled with other things besides the snapshot (snap) at which point the Sys.file_exists would fail.

I changed the code to use a combination of zfs list and Os.pread to look for the snapshot instead. Something like:

module Zfs = struct 
  (* ... *)
  let list_snapshots _t _ds = 
      Os.pread ["zfs"; "list"; "-t"; "snapshot"; "-o"; "name"] >|= String.split_on_char '\n' >|= fun res -> 
        match res with 
          | _::rest -> rest 
          | _ -> []
end 

let result t id =
  let ds = Dataset.result id in
  let path = Dataset.full_name t ds ~snapshot:default_snapshot in
  Zfs.list_snapshots t ds >|= fun snaps -> 
  if List.exists (String.equal path) snaps then Some path else None  

It's not pretty and it should definitely be using t and ds to narrow down the search, but it seems to sort everything out for me for now. Again, this is mainly for tracking purposes, I may discover I was doing something wrong and that's the reason.

avsm commented 3 years ago

Is the ZFS backend currently being used on any other operating systems than macOS?

patricoferris commented 3 years ago

To the best of my knowledge I don't believe it is. I think everything is using btrfs at the moment.

avsm commented 3 years ago

That's helpful -- at least we can unblock the macOS deployment with your workaround, and then see if we can isolate the issue on FreeBSD or Linux at a later stage to see if it's specific to the ZFS-on-macOS implementation

patricoferris commented 3 years ago

Yep -- I'm increasingly thinking it is either me or ZFS-on-macOS as it looks like the stress test https://github.com/ocurrent/obuilder/blob/master/.run-travis-tests.sh#L27 does test zfs and is happy enough (I haven't looked into exactly the build run so I'm assuming it does some "restoring" from snapshots).

patricoferris commented 3 years ago

... turns out it may well just be that on O3X (OpenzfsOnOsX) the snap directory is not automatically mounted, haven't tested it just yet (in a build) but running it on datasets I have lying about suggests it will fix the problem. If it fixes the problems I'll close the issue, sorry for the noise :))

patricoferris commented 3 years ago

Indeed this fixes any issues with using Sys.file_exists or reading the log from the snapshot by mounting the snapshot any time you wish to use it and unmounting it afterward. Seeing as it is a macOS specific thing, I'm closing this for now.