Open nh2 opened 2 years ago
A relevant part of the 1.0.0 spec is this wording
The
Upload-Offset
header’s value MUST be equal to the current offset of the resource ... If the offsets do not match, the Server MUST respond with the409 Conflict
status without modifying the upload resource.
~This wording is, in my opinion, a bit unfortunate, because it to some extent implies that idempotent concurrent updates to different offsets cannot exist.~
~It would be nice if a the v2 version of the spec could have a wording that more easily permits idempotent updates.~
EDIT: I guess it's OK, because:
409 Conflict
This seems to be the result of using
O_APPEND
here
That's not the case. The requirement for proper locking is not just a result of our implementation of the disk data store in tusd. Instead need for locks is a result of a tradeoff we have made when designing this protocol: Who is responsible for ensure that data is not corrupted or overwritten? The client or the server? You seem to assume that it is perfectly fine to let the client handle this responsibility but I disagree.
Your argument is based on the idea that the storage back (i.e. disk here) supports concurrent writes at different offsets. This is not always the case, e.g. for the AWS S3 or GCS backend. Also keep in mind that the locking mechanism is design to avoid the possibility of data loss or corruption in case of network interruptions. When the network is working fine, the locks are theoretically not needed. However, when network connections are interrupted, the server might not be immediately be notified about the interruption and thus might think a request is still running while the client is already sending another request. To avoid data corruption is such cases for various data stores (not just disks), locking is needed.
Hope that helps!
This is not always the case, e.g. for the AWS S3 or GCS backend
My issue is only about the filestore.go
implementation, not other backends, and not about the spec (I thought for a moment in https://github.com/tus/tusd/issues/697#issuecomment-1088007440 that the spec might have a wording that makes my suggestion difficult, but then found that it most likely doesn't).
What I am saying is that I think that tusd could implement filestore
specifically without the need for locking. For example, that you could run tusd on multiple machines against e.g. a CephFS mount, without need for locks.
tradeoff we have made when designing this protocol: Who is responsible for ensure that data is not corrupted or overwritten
The protocol does not seem to have an opinion on this topic. It appears to me that it would be a conformant implementation to accept whatever the client uploads. What type of corruption or overwriting do you have in mind? If a client intentionally or accidentally tried to tried to rely on a seek() + write()
implementation to cause the server write a certain series of bytes on the server, it could cause the server to write the same bytes using the current implementation.
In other words: I do not follow what scenario the current filestore implementation protects from that a non-O_APPEND implementation would not.
Could you make an example?
Assume the following situation without any locking: The server gets a request A to write at offset 0. Once A has written 100 bytes (and is still continuing to write), a new request B comes in to write at offset 100. Since the file's current offset is at 100, the server will accept request B. Now you have two request A and B writing at the same file offset, which is a potential cause for data loss or corruption. You could argue (and I think you use that argument here) that it is not dangerous to write the same data at the same offset. In an ideal world, both write calls execute the same behavior and should yield the same result. However, I don't know enough about how the kernel and file systems work to judge whether this is safe or not. If I am unsure about the safety of concurrent executions, I prefer to use exclusive access to control it.
You can argue that it is the clients' responsibility to avoid such potentially dangerous, concurrent requests. And that is a fair assumption, but not one we have chosen for tus and tusd. Instead, tusd is designed to catch such situations and prevent them. In my experience, cases as the one describe before, where two requests write at the same offset, can occur easily without the client intending to do so. This is sometimes caused by proxies or network switches who delay the flow of data.
Hope this makes my point of view clearer.
Also another comment: I do not see a problem with replacing O_APPEND with a call to write & seek on its own, as long as tusd uses a locking mechanism by default. If you want to and feel safe to, you can configure tusd to not use locking. That is totally doable programmatically at the moment.
Thanks for your detailed replies!
how the kernel and file systems work to judge whether this is safe or not
I can give some background on this.
POSIX 1003.1-2001 specifies in 2.9.7 Thread Interactions with Regular File Operations:
All of the following functions shall be atomic with respect to each other in the effects specified in POSIX.1-2017 when they operate on regular files or symbolic links: ...
write()
If two threads each call one of these functions, each call shall either see all of the specified effects of the other call, or none of them.
There are a few considerations regarding this:
write()
s by different threads or processes to the same file descriptor or underlying file are thread-safe in the conventional sense that it won't crash or lock up the computer or result in garbage/corrupt data that was not provided to any of the write()
.seek()
+write()
to an overlapping region? E.g. do concurrent write("abcd")
/ write("1234")
result in
abcd
or 1234
,a234
or 12cd
, but not a2c4
a2c4
Further, there is the distinction of how the write is done, e.g. via:open()
to which 2 identical or different (e.g. via dup()
) file descriptors may pointopen()
s of the same file path (and underlying inode)
(See this for a detail explanation of file descriptors and descriptions.)In the Linux kernel, the following happens:
open()
).
write()
s on the file description using a lock (f_pos_lock
in struct file
).ssize_t bytes_written = write(..)
may of course be a short write, meaning it wrote less than was asked for.)open()
s). Then Linux may interleave output. Linux is intentionally not fully POSIX compliant regarding this (see last paragraph of Linus's post linked).Thus, concurrent writes are safe (in the sense that you care about) on Linux and other POSIX systems, so doing concurrent write(bytes)
of the same bytestring at the same offset should always produce the same result as if only one of them were executed, or if they were executed sequentially. (This is of course assuming that multiple open()
s are used that each track their own file seek offset, otherwise "at the same offset" would not be given.)
This is why I think that my proposal does not shift the avoidance of dangerous concurrent requests to the client; POSIX (as the backend of filestore.go
) already fundamentally provides the necessary thread-safety.
I do not see a problem with replacing O_APPEND with a call to write & seek on its own, as long as tusd uses a locking mechanism by default.
Yes, that makes sense. The optimisation to not use locking could only be used when filestore.go
is writing to a sane POSIX file system (or POSIX-like, in the way that Linux is). It may not be safe against non-POSIX systems like an rclone mount
backed by blob storage like Drobox or S3.
Even better than seek()
+ write()
would be to use pwrite()
(Go's os.File.WriteAt()
) that combines the two for performance. But it does not really affect the argument whether concurrent writes are safe in general.
I am currently implementing my own TUS server backed by CephFS, where I'm using my proposal to avoid the need for distributed locking (and thus more simplicity / easier operation). I will report if I find anything that makes the approach infeasible; otherwise maybe tusd
also wants to use this optimisation.
Thank you very much for this detailed write up about the properties of write
! It provided a lot of insight that was not clear to me before.
Yes, that makes sense. The optimisation to not use locking could only be used when
filestore.go
is writing to a sane POSIX file system (or POSIX-like, in the way that Linux is). It may not be safe against non-POSIX systems like anrclone mount
backed by blob storage like Drobox or S3.
Probably, yes. I know that some people use tusd with NFS and that might also be problematic then.
Even better than
seek()
+write()
would be to usepwrite()
(Go'sos.File.WriteAt()
) that combines the two for performance.
Might be a bit harder to implement, as we write an io.Reader
to the file and not byte slices. Otherwise, it would be a good alternative.
I am currently implementing my own TUS server backed by CephFS, where I'm using my proposal to avoid the need for distributed locking (and thus more simplicity / easier operation). I will report if I find anything that makes the approach infeasible
Sure, let us know if you have any other question or find something interesting :)
In https://tus.io/faq.html#how-do-i-scale-tus there is the annoying requirement:
This seems to be the result of using
O_APPEND
here: https://github.com/tus/tusd/blob/27957bd22b9bf6b86a86aec3cc054558a8cb6518/pkg/filestore/filestore.go#L158But why is that?
Given that the TUS protocol communicates file offsets exactly in writes, couldn't
tusd
justseek()
to the correct offset and perform the write there?Then potentially concurrent writes would write the same data at the same offsets, thus being idempotent, and thus safe. No locking would be needed.
What do you think?