Closed Finix1979 closed 3 years ago
Good question, the change your referring to was made in PR #10163 and it contains a nice explanation of why the change was made. You're right that the order in which the operations are handled is undefined at the block layer. However, ZFS needs those operations to be atomic so there is still a need for the range lock and normally it should be uncontended.
Thank you @behlendorf
I noticed this PR when I compare zvol code between 0.8.5 and 2.0, and PR already mentioned the write order is not defined in that case. You mentioned above that zfs needs those operations to be atomic, could you please tell me what's purpose of atomic for? In other words, if no range lock acquirement, what's the wrong consequence?
I think the rationale is that modifying a single block on a block device is "defined" (where?) to be an atomic operation.
Without the range lock, I think the following race condition could happen:
dmu_write()
on the same offset within the same objectdmu_buf_hold_array()
dbuf_hold_impl()
dbuf_find()
doesn't find dbufs => use dbuf_create()
to allocate a dbufdmu_buf_hold_array()
dbuf_find()
re-uses the dbufs allocated for thread T1dmu_write_impl()
memcpy()
calls have a data racedmu_buf_will_fill()
, dmu_buf_will_dirty()
, or dmu_buf_fill_done()
have a data race as well.@behlendorf could you confirm or correct that
Looking at what real hardware supports:
For NVMe devices, an update of one logical-block address (LBA) is guaranteed to be atomic but the interface is awkward so he is not sure if anyone is really using it. SCSI has a nice interface, with good error reporting, for writing atomically, but he has not seen a single device that implements it.
( https://lwn.net/Articles/789600/ )
==> NVMe Spec seems to be quite clear about this:
If a write command is submitted with size less than or equal to the AWUN/NAWUN value and the write command does not cross an atomic boundary (refer to section 6.4.3), then the host is guaranteed that the write command is atomic to the NVM with respect to other read or write commands. If a write command is submitted with size greater than the AWUN/NAWUN value or crosses an atomic boundary, then there is no guarantee of command atomicity. AWUN/NAWUN does not have any applicability to write errors caused by power failure or other error conditions (refer to Atomic Write Unit Power Fail).
( https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4a-2020.03.09-Ratified.pdf )
==> SCSI spec
WRITE (10)
WRITE ATOMIC (16)
( https://www.seagate.com/files/staticfiles/support/docs/manual/Interface%20manuals/100293068j.pdf )
So at least for SCSI there are commands that do not require atomicity.
Thus, for the 'export ZVOL over iSCSI' use case, it could happen that an initiator sends the WRITE (10)
non-atomic command.
However, I'm not sure whether this happens in practice, and whether the Linux struct bio
that we ultimately handle in zvol_request()
can even represent the difference between WRITE
and WRITE ATOMIC
.
@behlendorf could you confirm or correct that
We want the per-block atomicity guarantee and therefore need to use range locks. Without the range lock, the race condition I outlined above could happen as described.
That's correct.
Ask your question!
I noticed that zfs2.0 uses taskq to handle zvol read/write operations. Before creating dum tx, it acquires range lock. But zfs_rangelock_enter function has the possibility return NULL, while this NULL point will be used by zfs_rangelock_exit and that will cause problem. My concern is do we really need RL_WRITER/RL_READER lock? If Filesystem issue multiple write operations on the same positions(offset, len), this is undefined. Block layer does not care about that. if so, we could remove the range lock logic in zvol read/write completely.
Which portion of the codebase does your question involve?
master
Additional context