open-iscsi / tcmu-runner

A daemon that handles the userspace side of the LIO TCM-User backstore.
Apache License 2.0
189 stars 148 forks source link

Potential race condition in aio_command_finish? #687

Open geordieintheshellcode opened 1 year ago

geordieintheshellcode commented 1 year ago

I'd like to get a second opinion on whether there is a potential race condition in aio_command_finish.

What does aio_command_finish do?

On entry into the function we call tcmur_tcmulib_cmd_complete which locks the device's cmds_list_lock and then calls tcmulib_command_complete to update the command ring tail. We then unlock the cmds_list_lock and call tcmulib_processing_complete to notify the kernel that there are commands to reap.

The Problem

We can have multiple threads calling into aio_command_finish concurrently. This means we can have one thread calling tcmulib_command_complete (updating the ring buffer tail) while another thread is calling tcmulib_processing_complete. tcmulib_processing_complete subsequently calls into tcmu_handle_completions on the kernel side (see target_core_user.c).

As far as I can tell there is insufficient synchronisation between the user-space thread which is updating the command ring tail pointer and the kernel thread which is reading the command ring tail pointer. Our tcmulib_command_complete thread is essentially doing:

tcmulib_command_complete()
    entry = list->tail
    entry->cmd_id = 42
    list->tail++

while our kernel thread is doing:

tcmu_handle_completion()
    while (last != READ_ONCE(list->tail))
       entry = last;
       if !find_cmd(entry->cmd_id)
          print_error("cmd_id %u not found, ring is broken\n")
       last++

Now, unless I'm misunderstanding something I believe that the CPU is entitled to re-order the instructions in tcmulib_command_complete could be re-ordered at runtime such that the increment to list->tail happens BEFORE the assignment to entry->data. In pseudo-code, something like:

 entry = list->tail
 list->tail++
 entry->cmd_id = 42

Whether this happens, and how likely it is to happen probably depends on the CPU architecture. If this was to occur then it would be possible for tcmu_handle_completions to access an entry via list->tail before the store to cmd_id has been made visible. This would result in a bogus value for cmd_id being read and potentially the call to find_cmd would fail, or worse result in the wrong command being processed. The READ_ONCE on the kernel side ensures that list->tail is always read (rather than being read from a cached value in a register), but it DOES NOT ensure that the stores to entry in user-space are only made visible once the increment of the tail pointer has been made visible. As far as I understand the store/load to list->tail should use relevant atomics or a memory barrier to ensure the correct synchronisation?

I'd appreciate it if someone were to review my logic! I definitely feel like I could have missed something. I'll see if I can repro it with tcmu-runner.

geordieintheshellcode commented 1 year ago

I've managed to repo the problem on a couple of aarch64 machines (Fedora 36 (aarch64), 5.19.12-200.fc36.aarch64 and Rocky Linux release 8.4 (Green Obsidian), 4.18.0-305.3.1.el8_4.aarch64). I can't repro the problem on x86.

Repro steps:

  1. Check out this branch https://github.com/geordieintheshellcode/tcmu-runner/tree/tcmu-race-condition. It's master with some extra scripts and a modification to the CMakeLists.txt to install the file_example handler.
  2. Build and install. Run the tcmu-runner binary.
  3. Run ./create.sh to create a 1 GiB file-backed TCMU backstore and hook it up to a SCSI HBA/LUN.
  4. Find the TCMU SCSI device and run fio against it: sudo FILE=/dev/<dev_name> QD=128 BS=4k fio randwrite.fio

After a while (almost instantly on one of my test machines, on the other it takes a few minutes), you'll see a bunch of write commands failing in dmesg and the target_core_user module reports that the ring is corrupted:

[Mon Oct  3 17:23:50 2022] sd 0:0:1:0: [sda] tag#339 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[Mon Oct  3 17:23:50 2022] sd 0:0:1:0: [sda] tag#339 CDB: Write(10) 2a 00 00 1c c9 f8 00 00 08 00
[Mon Oct  3 17:23:50 2022] sd 0:0:1:0: [sda] tag#276 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[Mon Oct  3 17:23:50 2022] sd 0:0:1:0: [sda] tag#276 CDB: Write(10) 2a 00 00 1d cd b8 00 00 08 00
[Mon Oct  3 17:23:50 2022] sd 0:0:1:0: [sda] tag#451 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[Mon Oct  3 17:23:50 2022] sd 0:0:1:0: [sda] tag#451 CDB: Write(10) 2a 00 00 15 b1 78 00 00 08 00
[Mon Oct  3 17:23:53 2022] sd 0:0:1:0: [sda] tag#704 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[Mon Oct  3 17:23:53 2022] sd 0:0:1:0: [sda] tag#704 CDB: Write(10) 2a 00 00 16 64 78 00 00 08 00
[Mon Oct  3 17:23:53 2022] cmd_id 6 not found, ring is broken
[Mon Oct  3 17:23:53 2022] sd 0:0:1:0: [sda] tag#901 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[Mon Oct  3 17:23:53 2022] ring broken, not handling completions

What's a bit confusing is even after applying the following patch to ensure that only a single thread is calling tcmur_tcmulib_cmd_complete and tcmulib_processing_complete at once, I still see failed write commands, but no ring corruption this time:

diff --git a/tcmur_cmd_handler.c b/tcmur_cmd_handler.c
index 01d6a10..20266e0 100644
--- a/tcmur_cmd_handler.c
+++ b/tcmur_cmd_handler.c
@@ -64,18 +64,24 @@ void tcmur_tcmulib_cmd_complete(struct tcmu_device *dev,
        pthread_cleanup_pop(0);
 }

+static pthread_mutex_t mega_lock = PTHREAD_MUTEX_INITIALIZER;
+
 static void aio_command_finish(struct tcmu_device *dev, struct tcmulib_cmd *cmd,
                               int rc)
 {
        struct tcmur_device *rdev = tcmu_dev_get_private(dev);
        int wake_up;

+       pthread_mutex_lock(&mega_lock);
+
        tcmur_tcmulib_cmd_complete(dev, cmd, rc);
        track_aio_request_finish(rdev, &wake_up);
        while (wake_up) {
                tcmulib_processing_complete(dev);
                track_aio_wakeup_finish(rdev, &wake_up);
        }
+
+       pthread_mutex_unlock(&mega_lock);
 }

dmesg:

[Tue Oct  4 11:47:03 2022] sd 0:0:1:0: [sda] tag#245 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[Tue Oct  4 11:47:03 2022] sd 0:0:1:0: [sda] tag#245 Sense Key : Hardware Error [current]
[Tue Oct  4 11:47:03 2022] sd 0:0:1:0: [sda] tag#245 Add. Sense: Internal target failure
[Tue Oct  4 11:47:03 2022] sd 0:0:1:0: [sda] tag#245 CDB: Write(10) 2a 00 00 1c 1a 78 00 00 08 00
[Tue Oct  4 11:47:03 2022] blk_update_request: critical target error, dev sda, sector 1841784 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0

and tcmu-runner.log:

2022-10-04 11:47:03.880 66678 [ERROR] check_iovec_length:143 file/file: iov len mismatch: iov len 4096, xfer len 40704, block size 512

This is quite odd as the fio is only writing 4096 B blocks! I wonder if we have a similar problem with the tcmur_cmdproc_thread thread calling tcmulib_get_next_command and clashing with the kernel thread which is updating the ring buffer. I've not had time to investigate this.

geordieintheshellcode commented 1 year ago

I think tcmu_get_next_command is also broken on aarch64. The problem seems to be that we load the cmd_head index without the necessary memory barriers. Atomically loading the cmd_head index in device_cmd_head() seemingly fixes the issue. More investigation is required though. I will break this out into a separate issue (https://github.com/open-iscsi/tcmu-runner/issues/688)

lxbsz commented 1 year ago

@geordieintheshellcode This sounds reasonable and could happen IMO.

Could you raise one PR to fix this ? Thanks!

geordieintheshellcode commented 1 year ago

Sorry I've been quite busy recently. I will open a PR as soon as I can. Cheers