njsmith / posy

286 stars 17 forks source link

Figuring out our relationship with the filesystem #12

Open njsmith opened 1 year ago

njsmith commented 1 year ago

Currently we handle all our on-disk storage through the KVFileStore and KVDirStore abstractions. They're both basically key->value mappings, where the value is either a file or a directory respectively. They have per-key locking, and try to implement atomic updates when possible.

But these aren't necessarily the best abstractions for what we need, because I wrote them before knowing how we were going to use them. And also, while I thought they should work on Windows, it turns out there were a few details I was missing (see #4) that makes them pretty fragile, esp. in the presence of operations like filesystem indexing or AV scanning that can randomly open files. And they don't currently have any support for garbage-collecting old data.

So here's a brain dump about what actual KV stores we've ended up with and what properties each one needs.

https://stackoverflow.com/a/57358387/ has some very smart sounding comments about what actually works on Windows, and one claim is that deleting a large file on NTFS can itself be non-atomic, and a badly timed crash could leave the file truncated instead. There's also the newer and undocumented (but public) FILE_RENAME_FLAG_POSIX_SEMANTICS which.... might do something useful? Might need to experiment to figure out what this thing actually does. [Edit: Turns out you need to read the kernel-level documentation. The answer is that it lets you overwrite a destination file that still has open handles. It doesn't, AFAIK, do anything to help with the case where the source file has open handles.]

I'm worried about data integrity; specifically, we currently trust that hash_cache/metadata_cache/wheel_cache/http_cache/EnvForest are normative, so if a truncated entry ended up there then everything could become wedged permanently until someone manually clears out the caches. That would suck. (On the other hand, lack of durability is fine -- if a value gets lost, or gets truncated but we can detect that it's truncated and discard it, then that's OK; these things can all be reconstructed if needed.)

Blob storage

For the ones that store blobs, we might just want to use a full-fledged transactional store, like sqlite or bdb. Trade-offs:

The main alternative is to do something like we're doing, with one on-disk file per value, which has a few challenges.

For integrity, files require either fsync and then atomic rename, or else some sort of checksum verification so we can detect and discard corrupted values. Neither is super attractive... fsync can make writes pretty expensive, and on Windows atomic rename can be thwarted by open handles. (I guess you can sleep and retry?) Though Windows does have CreateHardLink so you could at least link the file into place and then worry about deleting the original tmp file later opportunistically. (And this can even overwrite an existing file if you use NtSetInformationFile + FILE_LINK_INFORMATION + FILE_LINK_REPLACE_IF_EXISTS + FILE_LINK_POSIX_SEMANTICS.) Checksums make writes easy and fast, but then when you open the file again you have to read through the whole thing to validate the checksum, before you know whether you have a file at all. Fast checksums can be very fast (on my laptop even sha256 goes at ~1.6 GB/s according to openssl speed, and I assume crc32c would be even faster), but that's still an extra human-perceptible lag for multi-hundred-megabyte GPU wheels, and extra I/O. (Which might get hidden by caching, if the whole file fits in cache and the OS doesn't activate dropbehind logic for sequential scans and if we're going to read the whole file anyway, like we usually will for artifacts.)

I guess another option to avoid rename on Windows would be: write the file directly to its final name, fsync, and then put a marker file next to it to record that the main file is complete and trustworthy.

Or, we can combine them: write the file to disk, fsync, and then commit a record in sqlite saying that it's there and valid.

Finally, there's the question of garbage collection: for files, I think this is actually pretty easy? Unix and Windows do both support deleting a file while letting current readers continue (for Windows you need a magic POSIX_SEMANTICS flag but it's there).

Tentatively, it seems like we might want to use sqlite for metadata_cache and http_cache, and files for hash_cache and wheel_cache.

Directories

build_store isn't a big deal, because usage is restricted to a single process. We can just create directories and write to them whenever we like. We might specifically want to avoid renaming the directory into place, to avoid Windows issues. If we add multi-threading in the future then we'll probably want some kind of locking. But that's about it.

EnvForest OTOH is... idk, maybe intractable, in two different ways:

njsmith commented 1 year ago

Experimentally, it looks like on Windows:

rbtcollins commented 1 year ago

Windows behaviour varies by version, so you should test on the oldest version you want to support, for drawing inferences about locking etc behaviour.

njsmith commented 1 year ago

I've pretty much been assuming that we don't need to target anything older than evergreen Win 10, since everything older is already eol, and will be even more so by the time we need to worry about long-tail users. If there's some depressing reason that makes that wrong, please let me know :-)

Of course within Win 10 there's still a diversity of versions, but hopefully as long as we avoid the most-cutting-edge stuff we should be fine? And I'm not sure what else we can realistically do, since afaik there's no way to get older win 10 releases for testing.

On Sun, Feb 12, 2023, 02:48 Robert Collins @.***> wrote:

Windows behaviour varies by version, so you should test on the oldest version you want to support, for drawing inferences about locking etc behaviour.

— Reply to this email directly, view it on GitHub https://github.com/njsmith/posy/issues/12#issuecomment-1426999490, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEU42F4EWGO3KMPLYDF6CTWXC5XXANCNFSM6AAAAAAUKEIYEY . You are receiving this because you authored the thread.Message ID: @.***>

rbtcollins commented 1 year ago

Windows 10: yeah, sadly many enterprises won't be on evergreen Windows 10. There is a commonly documented oldest version to support; I suggest you don't support anything older than the Rust minimum version, (which is Windows 7 still)[https://doc.rust-lang.org/nightly/rustc/platform-support.html] :- but something newer than that, though perhaps not 'latest Windows 10'. I've had surprises where folk have a bug and the cause has been 'they work at a company'.

Some more thoughts:

File system atomicity is hard. We spent a lot of time on this in bzr.

I think a useful thing to think about is your threat model. Cache poisoning is a classic way to make everyones life hard. How will you deal with a hostile attacker who inserts malicious content at a known key in your cache?

OS caching. Hah! Windows don't cache like you might think. Assume that every read will be actual IO, except for directory metadata. (This is down to the very different structure for IO: rather than the page cache owning IO, processes own IO. Then when the process is killed, the IO can all get unwound. This is why virus scanners that are scanning files written by a process can cause the process to suffer IO latency even when there is tonnes of memory etc to buffer the files). I (did a talk on this)[https://www.youtube.com/watch?v=qbKGw8MQ0i8].

Fsync: for rustup we don't fsync files most we write: a machine crash can be recovered by removing the toolchain and reinstalling it, and machine crashes are very rare. We do rename-into-place [not per file, but per tree], to avoid partial writes being visible to readers. Files like the config and root metadata files we fsync. This seems to work ok. Rustup is missing some locking, specifically because we don't detect-and-merge concurrent component changes (e.g. add rustfmt and rust-analyzer from separate concurrent invocations of rustup; one will get lost).

rbtcollins commented 1 year ago

Now, less doom and gloom, some suggestions:

You probably wouldn't want such a daemon to be a pseudo-init, though thats a possible version of the design.

rbtcollins commented 1 year ago

(images) The free dev images are certainly just evergreen. Possibly an MSDN membership would get you older versions to test with. Guido Or Steve Dower might be able to arrange something?