Is range lock necessary in zvol read/write function?

Finix1979 commented 3 years ago

Ask your question!

I noticed that zfs2.0 uses taskq to handle zvol read/write operations. Before creating dum tx, it acquires range lock. But zfs_rangelock_enter function has the possibility return NULL, while this NULL point will be used by zfs_rangelock_exit and that will cause problem. My concern is do we really need RL_WRITER/RL_READER lock? If Filesystem issue multiple write operations on the same positions(offset, len), this is undefined. Block layer does not care about that. if so, we could remove the range lock logic in zvol read/write completely.

Which portion of the codebase does your question involve?

master

Additional context

behlendorf commented 3 years ago

Good question, the change your referring to was made in PR #10163 and it contains a nice explanation of why the change was made. You're right that the order in which the operations are handled is undefined at the block layer. However, ZFS needs those operations to be atomic so there is still a need for the range lock and normally it should be uncontended.

Finix1979 commented 3 years ago

Thank you @behlendorf

I noticed this PR when I compare zvol code between 0.8.5 and 2.0, and PR already mentioned the write order is not defined in that case. You mentioned above that zfs needs those operations to be atomic, could you please tell me what's purpose of atomic for? In other words, if no range lock acquirement, what's the wrong consequence?

problame commented 3 years ago

I think the rationale is that modifying a single block on a block device is "defined" (where?) to be an atomic operation.

Without the range lock, I think the following race condition could happen:

two threads T1, T2, both execute dmu_write() on the same offset within the same object
T1 completes dmu_buf_hold_array()
- calls dbuf_hold_impl()
- dbuf_find() doesn't find dbufs => use dbuf_create() to allocate a dbuf
T2 completes dmu_buf_hold_array()
- dbuf_find() re-uses the dbufs allocated for thread T1
At this point T1 and T2 both hold pointers to the same dbufs
T1 and T2 simulatenously execute dmu_write_impl()
- The memcpy() calls have a data race
- Maybe dmu_buf_will_fill(), dmu_buf_will_dirty(), or dmu_buf_fill_done() have a data race as well.

@behlendorf could you confirm or correct that

We want the per-block atomicity guarantee and therefore need to use range locks.
Without the range lock, the race condition I outlined above could happen as described.

problame commented 3 years ago

Looking at what real hardware supports:

For NVMe devices, an update of one logical-block address (LBA) is guaranteed to be atomic but the interface is awkward so he is not sure if anyone is really using it. SCSI has a nice interface, with good error reporting, for writing atomically, but he has not seen a single device that implements it.

( https://lwn.net/Articles/789600/ )

==> NVMe Spec seems to be quite clear about this:

If a write command is submitted with size less than or equal to the AWUN/NAWUN value and the write command does not cross an atomic boundary (refer to section 6.4.3), then the host is guaranteed that the write command is atomic to the NVM with respect to other read or write commands. If a write command is submitted with size greater than the AWUN/NAWUN value or crosses an atomic boundary, then there is no guarantee of command atomicity. AWUN/NAWUN does not have any applicability to write errors caused by power failure or other error conditions (refer to Atomic Write Unit Power Fail).

( https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4a-2020.03.09-Ratified.pdf )

==> SCSI spec

SCSI command WRITE (10)
SCSI command WRITE ATOMIC (16)

( https://www.seagate.com/files/staticfiles/support/docs/manual/Interface%20manuals/100293068j.pdf )

So at least for SCSI there are commands that do not require atomicity. Thus, for the 'export ZVOL over iSCSI' use case, it could happen that an initiator sends the WRITE (10) non-atomic command.

However, I'm not sure whether this happens in practice, and whether the Linux struct bio that we ultimately handle in zvol_request() can even represent the difference between WRITE and WRITE ATOMIC.

behlendorf commented 3 years ago

@behlendorf could you confirm or correct that

We want the per-block atomicity guarantee and therefore need to use range locks. Without the range lock, the race condition I outlined above could happen as described.

That's correct.

openzfs / zfs