openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.66k stars 1.76k forks source link

Is range lock necessary in zvol read/write function? #11259

Closed Finix1979 closed 3 years ago

Finix1979 commented 3 years ago

Ask your question!

I noticed that zfs2.0 uses taskq to handle zvol read/write operations. Before creating dum tx, it acquires range lock. But zfs_rangelock_enter function has the possibility return NULL, while this NULL point will be used by zfs_rangelock_exit and that will cause problem. My concern is do we really need RL_WRITER/RL_READER lock? If Filesystem issue multiple write operations on the same positions(offset, len), this is undefined. Block layer does not care about that. if so, we could remove the range lock logic in zvol read/write completely.

Which portion of the codebase does your question involve?

master

Additional context

behlendorf commented 3 years ago

Good question, the change your referring to was made in PR #10163 and it contains a nice explanation of why the change was made. You're right that the order in which the operations are handled is undefined at the block layer. However, ZFS needs those operations to be atomic so there is still a need for the range lock and normally it should be uncontended.

Finix1979 commented 3 years ago

Thank you @behlendorf

I noticed this PR when I compare zvol code between 0.8.5 and 2.0, and PR already mentioned the write order is not defined in that case. You mentioned above that zfs needs those operations to be atomic, could you please tell me what's purpose of atomic for? In other words, if no range lock acquirement, what's the wrong consequence?

problame commented 3 years ago

I think the rationale is that modifying a single block on a block device is "defined" (where?) to be an atomic operation.

Without the range lock, I think the following race condition could happen:

@behlendorf could you confirm or correct that

  1. We want the per-block atomicity guarantee and therefore need to use range locks.
  2. Without the range lock, the race condition I outlined above could happen as described.
problame commented 3 years ago

Looking at what real hardware supports:

For NVMe devices, an update of one logical-block address (LBA) is guaranteed to be atomic but the interface is awkward so he is not sure if anyone is really using it. SCSI has a nice interface, with good error reporting, for writing atomically, but he has not seen a single device that implements it.

( https://lwn.net/Articles/789600/ )

==> NVMe Spec seems to be quite clear about this:

If a write command is submitted with size less than or equal to the AWUN/NAWUN value and the write command does not cross an atomic boundary (refer to section 6.4.3), then the host is guaranteed that the write command is atomic to the NVM with respect to other read or write commands. If a write command is submitted with size greater than the AWUN/NAWUN value or crosses an atomic boundary, then there is no guarantee of command atomicity. AWUN/NAWUN does not have any applicability to write errors caused by power failure or other error conditions (refer to Atomic Write Unit Power Fail).

( https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4a-2020.03.09-Ratified.pdf )

==> SCSI spec

( https://www.seagate.com/files/staticfiles/support/docs/manual/Interface%20manuals/100293068j.pdf )


So at least for SCSI there are commands that do not require atomicity. Thus, for the 'export ZVOL over iSCSI' use case, it could happen that an initiator sends the WRITE (10) non-atomic command.

However, I'm not sure whether this happens in practice, and whether the Linux struct bio that we ultimately handle in zvol_request() can even represent the difference between WRITE and WRITE ATOMIC.

behlendorf commented 3 years ago

@behlendorf could you confirm or correct that

We want the per-block atomicity guarantee and therefore need to use range locks. Without the range lock, the race condition I outlined above could happen as described.

That's correct.