Avoid using O_APPEND mode to write files for simpler safe scaling

In https://tus.io/faq.html#how-do-i-scale-tus there is the annoying requirement:

You only need to ensure that there is no parallel access to a single upload resource from multiple tus servers.

This seems to be the result of using O_APPEND here: https://github.com/tus/tusd/blob/27957bd22b9bf6b86a86aec3cc054558a8cb6518/pkg/filestore/filestore.go#L158

But why is that?

Given that the TUS protocol communicates file offsets exactly in writes, couldn't tusd just seek() to the correct offset and perform the write there?

Then potentially concurrent writes would write the same data at the same offsets, thus being idempotent, and thus safe. No locking would be needed.

What do you think?

A relevant part of the 1.0.0 spec is this wording

The Upload-Offset header’s value MUST be equal to the current offset of the resource ... If the offsets do not match, the Server MUST respond with the 409 Conflict status without modifying the upload resource.

~This wording is, in my opinion, a bit unfortunate, because it to some extent implies that idempotent concurrent updates to different offsets cannot exist.~

~It would be nice if a the v2 version of the spec could have a wording that more easily permits idempotent updates.~

EDIT: I guess it's OK, because:

if the server notices the offsets (maintained with atomic overwrites, instead of locks) don't match, it can still send 409 Conflict
in the case it doesn't due to a race, one can still argue that idempotent updates do in fact not modify the upload resource (with the main exception perhaps being the file modification time)

This seems to be the result of using O_APPEND here

That's not the case. The requirement for proper locking is not just a result of our implementation of the disk data store in tusd. Instead need for locks is a result of a tradeoff we have made when designing this protocol: Who is responsible for ensure that data is not corrupted or overwritten? The client or the server? You seem to assume that it is perfectly fine to let the client handle this responsibility but I disagree.

Your argument is based on the idea that the storage back (i.e. disk here) supports concurrent writes at different offsets. This is not always the case, e.g. for the AWS S3 or GCS backend. Also keep in mind that the locking mechanism is design to avoid the possibility of data loss or corruption in case of network interruptions. When the network is working fine, the locks are theoretically not needed. However, when network connections are interrupted, the server might not be immediately be notified about the interruption and thus might think a request is still running while the client is already sending another request. To avoid data corruption is such cases for various data stores (not just disks), locking is needed.

Hope that helps!

This is not always the case, e.g. for the AWS S3 or GCS backend

My issue is only about the filestore.go implementation, not other backends, and not about the spec (I thought for a moment in https://github.com/tus/tusd/issues/697#issuecomment-1088007440 that the spec might have a wording that makes my suggestion difficult, but then found that it most likely doesn't).

What I am saying is that I think that tusd could implement filestore specifically without the need for locking. For example, that you could run tusd on multiple machines against e.g. a CephFS mount, without need for locks.

tradeoff we have made when designing this protocol: Who is responsible for ensure that data is not corrupted or overwritten

The protocol does not seem to have an opinion on this topic. It appears to me that it would be a conformant implementation to accept whatever the client uploads. What type of corruption or overwriting do you have in mind? If a client intentionally or accidentally tried to tried to rely on a seek() + write() implementation to cause the server write a certain series of bytes on the server, it could cause the server to write the same bytes using the current implementation.

In other words: I do not follow what scenario the current filestore implementation protects from that a non-O_APPEND implementation would not.

Could you make an example?

Assume the following situation without any locking: The server gets a request A to write at offset 0. Once A has written 100 bytes (and is still continuing to write), a new request B comes in to write at offset 100. Since the file's current offset is at 100, the server will accept request B. Now you have two request A and B writing at the same file offset, which is a potential cause for data loss or corruption. You could argue (and I think you use that argument here) that it is not dangerous to write the same data at the same offset. In an ideal world, both write calls execute the same behavior and should yield the same result. However, I don't know enough about how the kernel and file systems work to judge whether this is safe or not. If I am unsure about the safety of concurrent executions, I prefer to use exclusive access to control it.

You can argue that it is the clients' responsibility to avoid such potentially dangerous, concurrent requests. And that is a fair assumption, but not one we have chosen for tus and tusd. Instead, tusd is designed to catch such situations and prevent them. In my experience, cases as the one describe before, where two requests write at the same offset, can occur easily without the client intending to do so. This is sometimes caused by proxies or network switches who delay the flow of data.

Hope this makes my point of view clearer.

Also another comment: I do not see a problem with replacing O_APPEND with a call to write & seek on its own, as long as tusd uses a locking mechanism by default. If you want to and feel safe to, you can configure tusd to not use locking. That is totally doable programmatically at the moment.

Thanks for your detailed replies!

how the kernel and file systems work to judge whether this is safe or not

I can give some background on this.

POSIX 1003.1-2001 specifies in 2.9.7 Thread Interactions with Regular File Operations:

All of the following functions shall be atomic with respect to each other in the effects specified in POSIX.1-2017 when they operate on regular files or symbolic links: ... write() If two threads each call one of these functions, each call shall either see all of the specified effects of the other call, or none of them.

There are a few considerations regarding this:

Thread-safety: POSIX requires that concurrent write()s by different threads or processes to the same file descriptor or underlying file are thread-safe in the conventional sense that it won't crash or lock up the computer or result in garbage/corrupt data that was not provided to any of the write().
Atomicity: What is supposed to happen when 2 threads or processes seek()+write() to an overlapping region? E.g. do concurrent write("abcd") / write("1234") result in
- full atomic writes, that is, the result is guaranteed to be abcd or 1234,
- contiguous split writes, that is, the result might be a234 or 12cd, but not a2c4
- arbitrary interleaving, e.g. a2c4 Further, there is the distinction of how the write is done, e.g. via:
- the same file descripton, that is, the result of a single open() to which 2 identical or different (e.g. via dup()) file descriptors may point
- different file _description_s, e.g. 2 open()s of the same file path (and underlying inode) (See this for a detail explanation of file descriptors and descriptions.)

In the Linux kernel, the following happens:

Thread-safety is given (POSIX systems without this would be crazy).
Atomicity is given with full atomic writes only when the same file description is used (single open()).
- Then, Linux will serialise write()s on the file description using a lock (f_pos_lock in struct file).
- This is true only since Linux 3.14 (commit); before it was not (see LWN article write(), thread safety, and POSIX and in Linus's post). Full thread for easy reading here.
- (Note that any ssize_t bytes_written = write(..) may of course be a short write, meaning it wrote less than was asked for.)
Atomicity is NOT guaranteed when _different file description_s are used (multiple open()s). Then Linux may interleave output. Linux is intentionally not fully POSIX compliant regarding this (see last paragraph of Linus's post linked).

Thus, concurrent writes are safe (in the sense that you care about) on Linux and other POSIX systems, so doing concurrent write(bytes) of the same bytestring at the same offset should always produce the same result as if only one of them were executed, or if they were executed sequentially. (This is of course assuming that multiple open()s are used that each track their own file seek offset, otherwise "at the same offset" would not be given.)

This is why I think that my proposal does not shift the avoidance of dangerous concurrent requests to the client; POSIX (as the backend of filestore.go) already fundamentally provides the necessary thread-safety.

I do not see a problem with replacing O_APPEND with a call to write & seek on its own, as long as tusd uses a locking mechanism by default.

Yes, that makes sense. The optimisation to not use locking could only be used when filestore.go is writing to a sane POSIX file system (or POSIX-like, in the way that Linux is). It may not be safe against non-POSIX systems like an rclone mount backed by blob storage like Drobox or S3.

Even better than seek() + write() would be to use pwrite() (Go's os.File.WriteAt()) that combines the two for performance. But it does not really affect the argument whether concurrent writes are safe in general.

I am currently implementing my own TUS server backed by CephFS, where I'm using my proposal to avoid the need for distributed locking (and thus more simplicity / easier operation). I will report if I find anything that makes the approach infeasible; otherwise maybe tusd also wants to use this optimisation.

Thank you very much for this detailed write up about the properties of write! It provided a lot of insight that was not clear to me before.

Yes, that makes sense. The optimisation to not use locking could only be used when filestore.go is writing to a sane POSIX file system (or POSIX-like, in the way that Linux is). It may not be safe against non-POSIX systems like an rclone mount backed by blob storage like Drobox or S3.

Probably, yes. I know that some people use tusd with NFS and that might also be problematic then.

Even better than seek() + write() would be to use pwrite() (Go's os.File.WriteAt()) that combines the two for performance.

Might be a bit harder to implement, as we write an io.Reader to the file and not byte slices. Otherwise, it would be a good alternative.

I am currently implementing my own TUS server backed by CephFS, where I'm using my proposal to avoid the need for distributed locking (and thus more simplicity / easier operation). I will report if I find anything that makes the approach infeasible

Sure, let us know if you have any other question or find something interesting :)

tus / tusd

Avoid using O_APPEND mode to write files for simpler safe scaling #697