open-iscsi / tcmu-runner

A daemon that handles the userspace side of the LIO TCM-User backstore.
Apache License 2.0
189 stars 148 forks source link

Looking for some info on 1.4.0 release #449

Open draghuram opened 5 years ago

draghuram commented 5 years ago

Hi, I see that 1.4.0-rc1 has been released in May. Any details when 1.4.0 might be released? We are currently using 1.3.0 and seeing errors such as these in /var/log/messages:

kernel: sd 3:0:1:1: [sdg] tag#24 Sense Key : Not Ready [current] [descriptor] 
kernel: sd 3:0:1:1: [sdg] tag#12 CDB: Write(16) 8a 00 00 00 00 00 01 e7 51 80 00 00 00 80 00 00
kernel: sd 3:0:1:1: [sdg] tag#8 Add. Sense: Logical unit communication failure
kernel: sd 3:0:1:1: [sdg] tag#24 Add. Sense: Logical unit communication failure
kernel: sd 3:0:1:1: [sdg] tag#8 CDB: Write(16) 8a 00 00 00 00 00 01 e7 4f 00 00 00 00 80 00 00
kernel: blk_update_request: I/O error, dev sdg, sector 31936256

I verified that request for this sector never made it to our handler and a quick search brings up couple of links that points to possible race conditions in tcmu (not sure whether in kernel or in tcmu-runner). In any case, I want to try a later version of tcmu-runner to see if these types of errors go away. We are using "CentOS Linux release 7.5.1804".

Any help is greatly appreciated.

Thanks, Raghu

mikechristie commented 5 years ago

I am trying to finalize the release this month. Red Hat QA just started testing it.

What is the race you are referring to? I have never heard of it and there are no new fixes for something like that in the next release.

mikechristie commented 5 years ago

The error value you are seeing on the initiator side on the log above is a generic error code that occurs when:

  1. The handler has not set a known error code or the handler is passing in that sense code.
  2. Or you are using a old kernel that ignores the sense/status code passed in from target_core_user and/or tcmu-runner.
draghuram commented 5 years ago

I was referring to this: https://bugzilla.redhat.com/show_bug.cgi?id=1482376

The reason I suspected the issue might not be in handler was the fact that the handler never received requests for these sectors but I will double check. We are using "CentOS Linux release 7.5.1804" which seems to be the latest.

Thanks.

mikechristie commented 5 years ago

That bz does not seem to have anything to do with tcmu-runner as far as I can tell. I think it is a gluster specific bug.

There is a bunch of private gluster discussion on that bz. Are you using the glfs tcmu-runnher handler? If so, then I will cc them here so we can get more info on the issue and what the resolution is. I think it is basically upgrading to their new release that is not done yet.

draghuram commented 5 years ago

Thanks for the response. We are not using gluster but wrote our own handler. I just wanted to make sure the issue is not related to tcmu-runner.

mikechristie commented 5 years ago

The problem is that error code is generic one used in so many places in some versions of LIO. To narrow it down, could you give me:

  1. If you can trigger the problem without doing a lot of IO then enable debug scsi cmd:

log_level = 5

in /etc/tcmu/tcmu.conf. If you it takes a lot of IO then can you just do

log_level = 4

and then attach the /var/log/tcmu-runner.log.

  1. Enable target side debugging:

echo "module target_core_user +p" > /sys/kernel/debug/dynamic_debug/control echo "module target_core_mod +p" > /sys/kernel/debug/dynamic_debug/control

and then if you are using the iscsi target module do:

echo "module iscsi_target_mod +p" > /sys/kernel/debug/dynamic_debug/control

and then also attach the kernel logs.

  1. Tell me the kernel version you are using.

  2. If you can attach your tcmu-runner handler code it would be really helpful. If not, could you just add some log messages in your write() callback or your handle_cmd() if you use that to handle WRITEs.

  3. Could you also describe when this is happening? Are you doing anything special like failover. Are you using ALUA by any chance?

SidBB commented 5 years ago

We're creating a couple of 512 TB devices, creating a ZFS filesystems on them, and then writing a lot of data to both devices in parallel. These I/O errors typically show up after about 30 - 40 GB have been written to each, but in some cases in the past they've occurred much earlier too.

Run 1: tcmu-runner/master; log_level=5; cmd_time_out=30

The syslog shows a ZFS checkum verification panic at Aug 4 20:02:46, an indication of I/O error.

tcmu-runner.log: https://gist.github.com/SidBB/7edb57b22567461c00b7935d25b0a3f5 messages: https://gist.github.com/SidBB/53298f964672bdc1902f3db6cfd8630c dmesg: [I forgot to collect this one before rebooting, so the buffer was cleared.]

Run 2: tcmu-runner/1.3.0; log_level=4; cmd_time_out=0

Similar errors. The ZFS panic in question is at Aug 5 01:43:59.

tcmu-runner.log: https://gist.github.com/SidBB/038b60a48ccb2e6a6bda8242edf6e45a messages: https://gist.github.com/SidBB/44bd6d99ed020e308d893034c109a1b1 dmesg: https://gist.github.com/SidBB/918b66198735bb931181d9eeffb8993c

I also did a run with tcmu-runner/master and cmd_time_out=0 and they showed a similar error. Unfortunately I had accidentally disabled the debug flags during this run.

(We also do a lot of testing of ZFS on regular disks and don't see these problems there, so I'm inclined to think ZFS is not at fault here.)

We're using the distro-provided kernel 3.10.0-862.2.3.el7.x86_64 on RHEL/CentOS 7.5.

bgly commented 5 years ago

So looking at your latest logs, the first set contains no tcmu-runner errors. The only error I see is the ZFS Verify panic due to I/O error. Tcmu-runner still looks like its processing I/O even after the fact up until you removed the device which shows the cancel of recovery in your first tcmu-runner.log.

I also noticed from skimming your code that you handle iovec's weirdly in your file_write/read. You dont seek? Upon calling device.Pwrite or Pread you probably want to (device.seek) prior to the read then if any errors occur un-seek? That would remove the iov_temp stuff you are doing, and then allow you to use the tcmu_seek_in_iovec(iov, ret); command that you commented out. - Have you ran HTX or some sort of high stress data integrity test?

SidBB commented 5 years ago

Thanks, will discuss with @draghuram about the iovec handling.

We found a bug in our driver that caused the ZFS panic due to checksum errors so we've fixed that. Additionally @mikechristie suggested that the cmd_time_out may be a factor in the "not ready" errors, so we're retesting with a change to that param to see if it makes a difference.