libstorage / libstoragemgmt

A library for storage management
https://libstorage.github.io/libstoragemgmt-doc/
GNU Lesser General Public License v2.1
83 stars 32 forks source link

SCSI INQUIRY length of 0xFFFF is fatal to Areca RAID controllers; 0x4000 is the max #442

Closed dmick closed 3 years ago

dmick commented 3 years ago

We have several lab systems using Areca 1680 controllers:

Areca Technology Corp. ARC-1680 series PCIe to SAS/SATA 3Gb RAID Controller [17d3:1680]

These are quite old, and the firmware is old as well (although updated to current level), but we're using them daily in a high-I/O-load environment (for Ceph OSDs). A recent update to Ceph caused the machines to stop performing I/O to any drive on the card (which includes the system drive); see https://tracker.ceph.com/issues/48270 for gory details; but the end result is that if one submits a SCSI INQUIRY command with an ALLOCATION LENGTH (using the spec's capitalization and names) of > 0x4000, the controller apparently misbehaves; somehow it overwrites its buffer, it seems (as you can see from the tracker above, one manifestation was corrupted low physical memory). It almost always results in a SCSI bus timeout/reset, and the system usually does not recover.

libstoragemgmt _sg_io_vpd() starts by requesting page 0, or "supported pages", and does so with length _SG_T10_SPC_VPD_MAX_LEN, which is also _SG_T10_SPC_INQUIRY_MAX_LEN, which is 0xFFFF. This is a legal length according to the SCSI spec, but it's this that causes the controller to lose its mind.

I'm not certain of the right answer to this problem, but it might be that dropping the max length for inquiry commands to no more than 0x4000 would make the library more forgiving of such broken devices without causing any other loss of functionality. I believe the required length for the data being retrieved for page 0, at least, is 256 (one byte for every possible supported page code).

We can probably arrange for access to the hardware for testing.

dmick commented 3 years ago

Oh, I should mention that I can also reproduce the symptom with the 'sg_vpn' commandline program. It allows setting max length, and setting 0x4000 is safe, 0x4001 is fatal. (This is how I found the limit.)

tasleson commented 3 years ago

@dmick Thank you for reporting this. From reading the ceph issue it looks like they are providing some controls to disable. We can certainly reduce our size and maybe add an environmental variable to allow it to be changed if needed in certain situations too.

tasleson commented 3 years ago

@dmick Please let me know what would work best for you to test this change.

dmick commented 3 years ago

Further testing with a hacked library seems to be showing that 0x4000 isn't quite right yet. Looking deeper with guard values in the buffer shows the controller apparently transferring an excess of 0 bytes. I've got a few more experiments to try and will share my method when I get a crisper conclusion to see if you agree.

dmick commented 3 years ago

So I am simply not sure what exactly is going wrong; I've tried lengths all the way down to 0x400 without success so far. So I started examining the sg_inq algorithm; it looks like it sends inquiry commands with a 36-byte length to start, and then looks at the return to calculate a full length and sends another inquiry. I'm going to see if I can adapt this to libstoragemgmt's algorithm and see if it makes it more reliable.

dmick commented 3 years ago

Also I should note that some of the id information returned by the card actually calls it a "ARC-1222". I imagine 1680 is the compatible series.

dmick commented 3 years ago

(and of course really I mean the EVPD bit form of inquiry, which sg_inq sends with a length of 252, commenting "largest one-byte count divisible by 4")

dmick commented 3 years ago

See https://github.com/dmick/libstoragemgmt/commit/5482b545aa44ceca034b63149404d45f0d9ec9ba for a fix inspired by sg3_utils' sg_inq. I want to try to run a full functional test against one of these controllers if I can figure out how to do it, but please comment (I can make it a PR if you like).

tasleson commented 3 years ago

See dmick@5482b54 for a fix inspired by sg3_utils' sg_inq. I want to try to run a full functional test against one of these controllers if I can figure out how to do it, but please comment (I can make it a PR if you like).

Please put in a PR, thanks

tasleson commented 3 years ago

Resolved with https://github.com/libstorage/libstoragemgmt/pull/444