kevin-hanselman / dud

A lightweight CLI tool for versioning data alongside source code and building data pipelines.
https://kevin-hanselman.github.io/dud/
BSD 3-Clause "New" or "Revised" License
183 stars 8 forks source link

Improve data integrity for misbehaving users #133

Closed ofekd closed 1 year ago

ofekd commented 1 year ago

Please acknowledge the following

Describe the bug A user that writes to a file mid commit and uses the link strategy may end up with a mismatch between the saved checksum and actual checksum of the stored file. This can be prevented by os.Renameing the file to a temporary location before computing the checksum.

A similar issue can happen with the copy strategy where the saved checksum will be correct but the file will be corrupt. This can be fixed by moving first, then copying on checkout.

I am not sure if the user should be warned that data integrity is impossible to guarantee in cases where atomic rename is not available.

System information

Output of dud version:

dud version
v0.4.1-1-g5db6164

Output of uname -srmo:

uname -srmo
Linux 6.1.2-arch1-1 x86_64 GNU/Linux

If you're building from source, please provide your Go version.

go version
go version go1.19.5 linux/amd64

Steps to Reproduce Steps to reproduce the behavior. Ideally this a copy-paste-able shell script (or set of small scripts) that reproduces the problem.

mkdir reproduction
# beware this will output a large file (so we'll have time to mess with it while it's being committed)
dd if=/dev/random of=random.psd bs=4M count=5000
dud init
dud stage gen -o random.psd | tee random.yaml
dud stage add random.yaml
dud commit 
# meanwhile, in another terminal....
dd if=/dev/random of=random.psd bs=1 seek=190 count=1 conv=notrunc
rm random.psd
dud checkout -c

Expected behavior All checksums should match and dud checkout -c should not error out

ofekd commented 1 year ago

Forgot to mention this, but removing write permissions (by locking) for the duration of the commit is another possible solution, that can provide a middle ground between moving first and moving last

kevin-hanselman commented 1 year ago

Thanks for the feedback! This is definitely something I want to address. I think the write protection (read-only) approach is where I'd start. I think it's simpler to implement, and it would provide a better user experience ("Why is my file gone?" vs. "Why is my file read-only?").

ofekd commented 1 year ago

Yes, I agree. It is unfortunate that locks on Linux are advisory only but that's what we have, I guess.

kevin-hanselman commented 1 year ago

As I've dug into this edge case, it appears prohibitively difficult to adequately prevent the issue.

First, as you rightly mention, file locking in UNIX (with flock) is not strictly enforced; it expects processes to respect one another's "advisory" locks. In reality, this won't fix most cases, such as the example in the original post. A form of "mandatory" locking exists, but it requires a special filesystem mount option and is deemed "unreliable" by the Linux kernel developers. (See "Mandatory locking" on this manpage.)

Second, if a process already has a file open for writing, neither changing said file's permissions to read-only nor renaming the file will cause the writing process to fail and/or stop. I tried both of these things, and in retrospect, I should've known these approaches were doomed to fail. This is Linux/UNIX working as designed.

A third option would be to use a heuristic in Dud to see if a file is being written to by something else (e.g., read the same N bytes of a file and see if they change). This is a poor solution because such a heuristic can fail in endless ways -- How big is N? The first N, last N? How long between reads? -- or it will be too slow to be worth it (e.g. checksum the whole file again, immediately after commit).

A fourth option -- and in my opinion, the only suitable one -- would be to use something like lsof to check if to-be-committed files are opened by other processes and act accordingly (e.g., fail and inform the user, otherwise make the file read-only). This option is challenging for a number of reasons. First, this only applies to UNIX-like operating systems, and while that's all Dud supports right now, people are showing interest in Windows support (see #132). More critically though, I would either need to add an external dependency on the lsof binary, or use an equivalent Go package. In both of these cases, lsof is slow (e.g., macOS' native lsof runs in about 100ms when querying a single file), and the codebase will become more complex (especially to optimize lsof runtime e.g., by batching calls). Third, this is not an atomic operation and therefore suffers from race conditions between Dud and other processes.

Unfortunately, in light of all this, I would like to close this issue as wont-fix for the time being. Thank you again for bringing this issue to light and providing a thorough writeup! If you have other ideas for how to fix this in a lightweight way, please feel free to continue to use this thread; I am certainly not opposed to reopening it I've missed a better solution.