Looking for some info on 1.4.0 release

draghuram commented 5 years ago

Hi, I see that 1.4.0-rc1 has been released in May. Any details when 1.4.0 might be released? We are currently using 1.3.0 and seeing errors such as these in /var/log/messages:

kernel: sd 3:0:1:1: [sdg] tag#24 Sense Key : Not Ready [current] [descriptor] 
kernel: sd 3:0:1:1: [sdg] tag#12 CDB: Write(16) 8a 00 00 00 00 00 01 e7 51 80 00 00 00 80 00 00
kernel: sd 3:0:1:1: [sdg] tag#8 Add. Sense: Logical unit communication failure
kernel: sd 3:0:1:1: [sdg] tag#24 Add. Sense: Logical unit communication failure
kernel: sd 3:0:1:1: [sdg] tag#8 CDB: Write(16) 8a 00 00 00 00 00 01 e7 4f 00 00 00 00 80 00 00
kernel: blk_update_request: I/O error, dev sdg, sector 31936256

I verified that request for this sector never made it to our handler and a quick search brings up couple of links that points to possible race conditions in tcmu (not sure whether in kernel or in tcmu-runner). In any case, I want to try a later version of tcmu-runner to see if these types of errors go away. We are using "CentOS Linux release 7.5.1804".

Any help is greatly appreciated.

Thanks, Raghu

mikechristie commented 5 years ago

I am trying to finalize the release this month. Red Hat QA just started testing it.

What is the race you are referring to? I have never heard of it and there are no new fixes for something like that in the next release.

mikechristie commented 5 years ago

The error value you are seeing on the initiator side on the log above is a generic error code that occurs when:

The handler has not set a known error code or the handler is passing in that sense code.
Or you are using a old kernel that ignores the sense/status code passed in from target_core_user and/or tcmu-runner.

draghuram commented 5 years ago

I was referring to this: https://bugzilla.redhat.com/show_bug.cgi?id=1482376

The reason I suspected the issue might not be in handler was the fact that the handler never received requests for these sectors but I will double check. We are using "CentOS Linux release 7.5.1804" which seems to be the latest.

Thanks.

mikechristie commented 5 years ago

That bz does not seem to have anything to do with tcmu-runner as far as I can tell. I think it is a gluster specific bug.

There is a bunch of private gluster discussion on that bz. Are you using the glfs tcmu-runnher handler? If so, then I will cc them here so we can get more info on the issue and what the resolution is. I think it is basically upgrading to their new release that is not done yet.

draghuram commented 5 years ago

Thanks for the response. We are not using gluster but wrote our own handler. I just wanted to make sure the issue is not related to tcmu-runner.

mikechristie commented 5 years ago

The problem is that error code is generic one used in so many places in some versions of LIO. To narrow it down, could you give me:

If you can trigger the problem without doing a lot of IO then enable debug scsi cmd:

log_level = 5

in /etc/tcmu/tcmu.conf. If you it takes a lot of IO then can you just do

log_level = 4

and then attach the /var/log/tcmu-runner.log.

Enable target side debugging:

echo "module target_core_user +p" > /sys/kernel/debug/dynamic_debug/control echo "module target_core_mod +p" > /sys/kernel/debug/dynamic_debug/control

and then if you are using the iscsi target module do:

echo "module iscsi_target_mod +p" > /sys/kernel/debug/dynamic_debug/control

and then also attach the kernel logs.

Tell me the kernel version you are using.
If you can attach your tcmu-runner handler code it would be really helpful. If not, could you just add some log messages in your write() callback or your handle_cmd() if you use that to handle WRITEs.
Could you also describe when this is happening? Are you doing anything special like failover. Are you using ALUA by any chance?

SidBB commented 5 years ago

We're creating a couple of 512 TB devices, creating a ZFS filesystems on them, and then writing a lot of data to both devices in parallel. These I/O errors typically show up after about 30 - 40 GB have been written to each, but in some cases in the past they've occurred much earlier too.

Run 1: tcmu-runner/master; log_level=5; cmd_time_out=30

The syslog shows a ZFS checkum verification panic at Aug 4 20:02:46, an indication of I/O error.

tcmu-runner.log: https://gist.github.com/SidBB/7edb57b22567461c00b7935d25b0a3f5 messages: https://gist.github.com/SidBB/53298f964672bdc1902f3db6cfd8630c dmesg: [I forgot to collect this one before rebooting, so the buffer was cleared.]

Run 2: tcmu-runner/1.3.0; log_level=4; cmd_time_out=0

Similar errors. The ZFS panic in question is at Aug 5 01:43:59.

tcmu-runner.log: https://gist.github.com/SidBB/038b60a48ccb2e6a6bda8242edf6e45a messages: https://gist.github.com/SidBB/44bd6d99ed020e308d893034c109a1b1 dmesg: https://gist.github.com/SidBB/918b66198735bb931181d9eeffb8993c

I also did a run with tcmu-runner/master and cmd_time_out=0 and they showed a similar error. Unfortunately I had accidentally disabled the debug flags during this run.

(We also do a lot of testing of ZFS on regular disks and don't see these problems there, so I'm inclined to think ZFS is not at fault here.)

We're using the distro-provided kernel 3.10.0-862.2.3.el7.x86_64 on RHEL/CentOS 7.5.

bgly commented 5 years ago

So looking at your latest logs, the first set contains no tcmu-runner errors. The only error I see is the ZFS Verify panic due to I/O error. Tcmu-runner still looks like its processing I/O even after the fact up until you removed the device which shows the cancel of recovery in your first tcmu-runner.log.

I also noticed from skimming your code that you handle iovec's weirdly in your file_write/read. You dont seek? Upon calling device.Pwrite or Pread you probably want to (device.seek) prior to the read then if any errors occur un-seek? That would remove the iov_temp stuff you are doing, and then allow you to use the tcmu_seek_in_iovec(iov, ret); command that you commented out. - Have you ran HTX or some sort of high stress data integrity test?

SidBB commented 5 years ago

Thanks, will discuss with @draghuram about the iovec handling.

We found a bug in our driver that caused the ZFS panic due to checksum errors so we've fixed that. Additionally @mikechristie suggested that the cmd_time_out may be a factor in the "not ready" errors, so we're retesting with a change to that param to see if it makes a difference.

open-iscsi / tcmu-runner

Looking for some info on 1.4.0 release #449