ocurrent / obuilder

Experimental "docker build" alternative using btrfs/zfs snapshots
Apache License 2.0
60 stars 17 forks source link

Rsync hard-links to save space #102

Closed art-w closed 2 years ago

art-w commented 2 years ago

Please note that I have NO idea what I'm doing: I'm working under the assumption that files are never modified in place in the store but always copied elsewhere first. If that ain't the case, well, please ignore and close this PR harshly!

I was hoping to save a bit of disk space in the obuilder store when using rsync:

/obuilder/store/result/ $ du -sh
2.1G    3438610c272ea59ea6c9b0cb93557cc430009e63f3d89d85416af6889fb94150
2.1G    ce0813553200cf888b0efa0cb2b2421dba8aaf4ba2c898d1a7dffc4483700ebd

Here ce0813 was created from 343861 by running sudo ln -f /usr/bin/opam-2.0 /usr/bin/opam. As a full build involves a dozen steps, the copy-everything is eating my disk alive... But by asking nicely, rsync could observe that files from ce0813 are identical to those in 343861 and create hard links to the originals rather than a real copy. This is obviously wrong if either can be updated in place later!

Regarding the rsync arguments:

Anyway, the result makes me sad. Hard-links are correctly created for files, but not directories: (because "stuff tends to break when your fs is not a tree")

/obuilder/store/result/ $ du -sh
2.1G    3438610c272ea59ea6c9b0cb93557cc430009e63f3d89d85416af6889fb94150
394M    ce0813553200cf888b0efa0cb2b2421dba8aaf4ba2c898d1a7dffc4483700ebd

"Everything is a file, but some files are more files than others."

talex5 commented 2 years ago

Sounds like a good idea to me, but I'm not familiar with this backend.

patricoferris commented 2 years ago

Thanks for this @art-w ! With @talex5's help I wrote the rsync backend (mainly for it's convenience and also for the macOS port of obuilder).

I'm working under the assumption that files are never modified in place in the store but always copied elsewhere first.

Yep, that is a correct assumption (or at least should be!). Anything in result/<hash> should be immutable. The rsync backend is based off of the btrfs backend and typically will:

So, from a Linux perspective, I think this change is good. It does come at the added cost of doing a copy rather than a rename, but I think the potential disk space saving is worth it. The rsync backend isn't supposed to be fast. I'll follow up again soon after I rebase and try this with the experimental macOS port (see #87).

art-w commented 2 years ago

Thanks, your explanations does match my intuition! Yes the original mv result-tmp/xyz result/xyz becomes a two-step "cp" result-tmp/xyz result/xyz ; rm result-tmp/xyz and yes, the copy is much slower than a rename :/

(Out of curiosity, I'm going to run some tests without --checksum to see how expensive it is, but I don't think it's 100% safe to skip it in obuilder use-case.)

art-w commented 2 years ago

Thanks for testing it out on another platform! I like the idea of letting the user choose the tradeoff so I added the corresponding CLI flag: by default it keeps the original copy behavior, but it's safe to switch to hardlink (back and forth with copy, the store shouldn't care).

On your example, I get 7.9G in 3m25s for copy, 3.1G in 7m35s for hardlink ... and 3.1G in 4m20s for hardlink_unsafe (no checksum, which is the mode I would like to use when developing). Let me know if you would rather not have this third dangerous option!