Open geordieintheshellcode opened 1 year ago
I've managed to repo the problem on a couple of aarch64 machines (Fedora 36 (aarch64), 5.19.12-200.fc36.aarch64
and Rocky Linux release 8.4 (Green Obsidian), 4.18.0-305.3.1.el8_4.aarch64
). I can't repro the problem on x86.
Repro steps:
master
with some extra scripts and a modification to the CMakeLists.txt
to install the file_example
handler.tcmu-runner
binary../create.sh
to create a 1 GiB file-backed TCMU backstore and hook it up to a SCSI HBA/LUN.sudo FILE=/dev/<dev_name> QD=128 BS=4k fio randwrite.fio
After a while (almost instantly on one of my test machines, on the other it takes a few minutes), you'll see a bunch of write commands failing in dmesg and the target_core_user
module reports that the ring is corrupted:
[Mon Oct 3 17:23:50 2022] sd 0:0:1:0: [sda] tag#339 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[Mon Oct 3 17:23:50 2022] sd 0:0:1:0: [sda] tag#339 CDB: Write(10) 2a 00 00 1c c9 f8 00 00 08 00
[Mon Oct 3 17:23:50 2022] sd 0:0:1:0: [sda] tag#276 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[Mon Oct 3 17:23:50 2022] sd 0:0:1:0: [sda] tag#276 CDB: Write(10) 2a 00 00 1d cd b8 00 00 08 00
[Mon Oct 3 17:23:50 2022] sd 0:0:1:0: [sda] tag#451 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[Mon Oct 3 17:23:50 2022] sd 0:0:1:0: [sda] tag#451 CDB: Write(10) 2a 00 00 15 b1 78 00 00 08 00
[Mon Oct 3 17:23:53 2022] sd 0:0:1:0: [sda] tag#704 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[Mon Oct 3 17:23:53 2022] sd 0:0:1:0: [sda] tag#704 CDB: Write(10) 2a 00 00 16 64 78 00 00 08 00
[Mon Oct 3 17:23:53 2022] cmd_id 6 not found, ring is broken
[Mon Oct 3 17:23:53 2022] sd 0:0:1:0: [sda] tag#901 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[Mon Oct 3 17:23:53 2022] ring broken, not handling completions
What's a bit confusing is even after applying the following patch to ensure that only a single thread is calling tcmur_tcmulib_cmd_complete
and tcmulib_processing_complete
at once, I still see failed write commands, but no ring corruption this time:
diff --git a/tcmur_cmd_handler.c b/tcmur_cmd_handler.c
index 01d6a10..20266e0 100644
--- a/tcmur_cmd_handler.c
+++ b/tcmur_cmd_handler.c
@@ -64,18 +64,24 @@ void tcmur_tcmulib_cmd_complete(struct tcmu_device *dev,
pthread_cleanup_pop(0);
}
+static pthread_mutex_t mega_lock = PTHREAD_MUTEX_INITIALIZER;
+
static void aio_command_finish(struct tcmu_device *dev, struct tcmulib_cmd *cmd,
int rc)
{
struct tcmur_device *rdev = tcmu_dev_get_private(dev);
int wake_up;
+ pthread_mutex_lock(&mega_lock);
+
tcmur_tcmulib_cmd_complete(dev, cmd, rc);
track_aio_request_finish(rdev, &wake_up);
while (wake_up) {
tcmulib_processing_complete(dev);
track_aio_wakeup_finish(rdev, &wake_up);
}
+
+ pthread_mutex_unlock(&mega_lock);
}
dmesg:
[Tue Oct 4 11:47:03 2022] sd 0:0:1:0: [sda] tag#245 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[Tue Oct 4 11:47:03 2022] sd 0:0:1:0: [sda] tag#245 Sense Key : Hardware Error [current]
[Tue Oct 4 11:47:03 2022] sd 0:0:1:0: [sda] tag#245 Add. Sense: Internal target failure
[Tue Oct 4 11:47:03 2022] sd 0:0:1:0: [sda] tag#245 CDB: Write(10) 2a 00 00 1c 1a 78 00 00 08 00
[Tue Oct 4 11:47:03 2022] blk_update_request: critical target error, dev sda, sector 1841784 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0
and tcmu-runner.log:
2022-10-04 11:47:03.880 66678 [ERROR] check_iovec_length:143 file/file: iov len mismatch: iov len 4096, xfer len 40704, block size 512
This is quite odd as the fio
is only writing 4096 B blocks! I wonder if we have a similar problem with the tcmur_cmdproc_thread
thread calling tcmulib_get_next_command
and clashing with the kernel thread which is updating the ring buffer. I've not had time to investigate this.
I think tcmu_get_next_command
is also broken on aarch64. The problem seems to be that we load the cmd_head
index without the necessary memory barriers. Atomically loading the cmd_head
index in device_cmd_head()
seemingly fixes the issue. More investigation is required though. I will break this out into a separate issue (https://github.com/open-iscsi/tcmu-runner/issues/688)
@geordieintheshellcode This sounds reasonable and could happen IMO.
Could you raise one PR to fix this ? Thanks!
Sorry I've been quite busy recently. I will open a PR as soon as I can. Cheers
I'd like to get a second opinion on whether there is a potential race condition in
aio_command_finish
.What does
aio_command_finish
do?On entry into the function we call
tcmur_tcmulib_cmd_complete
which locks the device'scmds_list_lock
and then callstcmulib_command_complete
to update the command ringtail
. We then unlock thecmds_list_lock
and calltcmulib_processing_complete
to notify the kernel that there are commands to reap.The Problem
We can have multiple threads calling into
aio_command_finish
concurrently. This means we can have one thread callingtcmulib_command_complete
(updating the ring buffer tail) while another thread is callingtcmulib_processing_complete
.tcmulib_processing_complete
subsequently calls intotcmu_handle_completions
on the kernel side (see target_core_user.c).As far as I can tell there is insufficient synchronisation between the user-space thread which is updating the command ring tail pointer and the kernel thread which is reading the command ring tail pointer. Our
tcmulib_command_complete
thread is essentially doing:while our kernel thread is doing:
Now, unless I'm misunderstanding something I believe that the CPU is entitled to re-order the instructions in
tcmulib_command_complete
could be re-ordered at runtime such that the increment to list->tail happens BEFORE the assignment to entry->data. In pseudo-code, something like:Whether this happens, and how likely it is to happen probably depends on the CPU architecture. If this was to occur then it would be possible for
tcmu_handle_completions
to access anentry
vialist->tail
before the store tocmd_id
has been made visible. This would result in a bogus value forcmd_id
being read and potentially the call tofind_cmd
would fail, or worse result in the wrong command being processed. TheREAD_ONCE
on the kernel side ensures thatlist->tail
is always read (rather than being read from a cached value in a register), but it DOES NOT ensure that the stores toentry
in user-space are only made visible once the increment of the tail pointer has been made visible. As far as I understand the store/load tolist->tail
should use relevant atomics or a memory barrier to ensure the correct synchronisation?I'd appreciate it if someone were to review my logic! I definitely feel like I could have missed something. I'll see if I can repro it with
tcmu-runner
.