opensvc / multipath-tools

Other
59 stars 47 forks source link

Multipath aborts discovery of SAS devices lacking SCSI_VENDOR field #56

Closed malventano closed 1 year ago

malventano commented 1 year ago

Some SAS devices (in this case Seagate factory recertified 'white label' drives) may come with the Vendor field blank. This causes Multipath to fail to complete the discovery of those devices. Manually adding associated wwid's to multipath.conf has no effect.

multipath -v9 showing a pair of the suspect SAS devices (note that the process ends after the size check, where normally the next entry is vendor)

Nov 12 23:36:22 | Discover device /sys/devices/pci0000:d7/0000:d7:02.0/0000:da:00.0/host19/port-19:0/expander-19:0/port-19:0:1/expander-19:1/port-19:1:0/end_device-19:1:0/target19:0:1/19:0:1:0/block/sdjp
Nov 12 23:36:22 | 65:304: dev_t not found in pathvec
Nov 12 23:36:22 | sdjp: udev property ID_WWN whitelisted
Nov 12 23:36:22 | sdjp: mask = 0x3f
Nov 12 23:36:22 | sdjp: dev_t = 65:304
Nov 12 23:36:22 | open '/sys/devices/pci0000:d7/0000:d7:02.0/0000:da:00.0/host19/port-19:0/expander-19:0/port-19:0:1/expander-19:1/port-19:1:0/end_device-19:1:0/target19:0:1/19:0:1:0/block/sdjp/size'
Nov 12 23:36:22 | sdjp: size = 31251759104
Nov 12 23:36:22 | Discover device /sys/devices/pci0000:d7/0000:d7:02.0/0000:da:00.0/host19/port-19:0/expander-19:0/port-19:0:1/expander-19:1/port-19:1:1/end_device-19:1:1/target19:0:2/19:0:2:0/block/sdjq
Nov 12 23:36:22 | 65:320: dev_t not found in pathvec
Nov 12 23:36:22 | sdjq: udev property ID_WWN whitelisted
Nov 12 23:36:22 | sdjq: mask = 0x3f
Nov 12 23:36:22 | sdjq: dev_t = 65:320
Nov 12 23:36:22 | open '/sys/devices/pci0000:d7/0000:d7:02.0/0000:da:00.0/host19/port-19:0/expander-19:0/port-19:0:1/expander-19:1/port-19:1:1/end_device-19:1:1/target19:0:2/19:0:2:0/block/sdjq/size'
Nov 12 23:36:22 | sdjq: size = 31251759104

udevadm info output appears normal vs. other SAS devices except the following two fields are missing:

SCSI_VENDOR
SCSI_VENDOR_ENC

smartctl -a -d scsi output from one of the drives:

smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.64-1-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, [www.smartmontools.org](http://www.smartmontools.org/)

=== START OF INFORMATION SECTION ===
Vendor:
Product:              OOS16000G
Revision:             OOS1
Compliance:           SPC-5
User Capacity:        16,000,900,661,248 bytes [16.0 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c500d776128b
Serial number:        00000PJM0000C1507RXB
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Sun Nov 13 14:04:50 2022 EST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
mwilck commented 1 year ago

Quoting from the SPC-6, §6.7.2: "The T10 VENDOR IDENTIFICATION field contains eight bytes of left-aligned ASCII data (see 4.3.1) identifying the manufacturer of the logical unit. The T10 vendor identification shall be one assigned by INCITS. A list of assigned T10 vendor identifications is in Annex G and on the T10 web site (http://www.t10.org)."

I daresay this device fails to comply to the most basic SCSI standard (assuming that a sequence of 8 spaces is not an officially assigned vendor id). That's not to say we shouldn't fix it, but it's really a weird corner case.

malventano commented 1 year ago

Totally understood that it is a weird case, but Seagate appears to be doing this (by their own internal policy) for their white-label recertified drives. I've fed back via the reseller that their choice breaks basic functionality that relies on the Vendor field, but who knows if / when that feedback will be addressed.

...but now that drives with no Vendor ID appear to exist in the wild, it has highlighted Multipath's inability to handle such devices.

mwilck commented 1 year ago

Can you please try with this patch?

0001-libmultipath-pathinfo-don-t-fail-for-devices-lacking.patch

malventano commented 1 year ago

I returned the issue drives for other models, but I know a few others who have some (they are not currently using multipath). Will spread the word and/or try and pick up a single for testing.

mwilck commented 1 year ago

@malventano, I fail to parse your response. Does this mean the patch worked for you, or that you couldn't test it because you don't use multipath?

malventano commented 1 year ago

I returned the drives that were not reporting vendor ID (swapped them for a different model), but I am working with someone else who has some and can test. Worst case I'll get my hands on one to retest. More to follow :)

malventano commented 1 year ago

Update with the patch:

[root@archlinux ~]# multipath -ll
53.526500 | /etc/multipath.conf line 24, "bindings_file" is deprecated and will be disabled in a future release
53.526543 | /etc/multipath.conf line 25, "wwids_file" is deprecated and will be disabled in a future release
53.526550 | /etc/multipath.conf line 26, "prkeys_file" is deprecated and will be disabled in a future release
53.533269 | sdb: broken device without vendor ID
53.537700 | sdd: broken device without vendor ID
53.546100 | sdb: broken device without vendor ID
53.546944 | sdd: broken device without vendor ID
mpatha (35000c500cb25a62b) dm-1 UNKNOWN,OOS18000G
size=16T features='0' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=1 status=active
| `- 2:0:0:0 sdb 8:16 active ready running
`-+- policy='service-time 0' prio=1 status=enabled
  `- 2:0:3:0 sdd 8:48 active ready running

...so the patch allowed the addition of drives lacking vendor ID (did note the 'broken device' warnings are appearing 2x though). Unsure if we did something wrong with the build to trigger the warnings for bindings_file / wwids_file / prkeys_file, or are those actually being depreciated?

bmarzins commented 1 year ago

Those files really are deprecated. So if they are in your config file, then this is expected.

malventano commented 1 year ago

Those files really are deprecated. So if they are in your config file, then this is expected.

Are these files not the primary way that multipath tracks added devices? If they are depreciated, why are they still a part of the default config? What new method is being used to save/track previously added devices?

mwilck commented 1 year ago

Are these files not the primary way that multipath tracks added devices?

Yes. What's deprecated is configuring the path to these files in multipath.conf. The paths are set at compile time now.

If they are depreciated, why are they still a part of the default config?

Not sure what you mean. multipath has long stopped shipping with a "default config". More often than not, you're well off with an empty config file. The manual page states that these settings are deprecated.

I can see that, if multipath conf is generated with multipath -t, these options would be included, causing error messages later on. This is something we should fix eventually.

...so the patch allowed the addition of drives lacking vendor ID (did note the 'broken device' warnings are appearing 2x though).

That's how it is. We're not going to implement a per-device warning counter just for these broken devices.

mwilck commented 1 year ago

Correction:

I can see that, if multipath conf is generated with multipath -t, these options would be included

If you read the message it says that they "will be disabled in a future release". When that happens, we will also stop printing them with "multipath -t".

mwilck commented 1 year ago

I posted the patch above to dm-devel today, "libmultipath: pathinfo: don't fail for devices lacking INQUIRY properties"

malventano commented 1 year ago

Not sure what you mean.

I was reading those warning messages:

53.526500 | /etc/multipath.conf line 24, "bindings_file" is deprecated and will be disabled in a future release
53.526543 | /etc/multipath.conf line 25, "wwids_file" is deprecated and will be disabled in a future release
53.526550 | /etc/multipath.conf line 26, "prkeys_file" is deprecated and will be disabled in a future release

...to mean that the files themselves were depreciated. If I'm reading your response correctly, it is the path configurability that has been depreciated, which I have no argument against :).

More often than not, you're well off with an empty config file.

Totally agree that an empty config is ideal save a few customization entries as applicable, but the standard practice I've seen nearly everywhere is to generate a 'default' config with multipath -t and go from there. The man suggests -T for generating a new multipath.conf - does -T currently omit those depreciated variables while -t leaves them in?

When that happens, we will also stop printing them with "multipath -t".

It may be beneficial to omit those values from -t sooner rather than later so new multipath users are not tempted to adopt a depreciated feature.

We're not going to implement a per-device warning counter just for these broken devices.

My point was that the warning is appearing twice per path in rapid succession (so 4x for a single device in this case). Is it expected to process each path 2x for a single multipath -ll command issued?

mwilck commented 1 year ago

The man suggests -T for generating a new multipath.conf - does -T currently omit those depreciated variables while -t leaves them in?

No. There's no difference between '-t and -T in this respect. They differ only in the way they print device properties. Thinking about it, the correct way to "solve" this issue will be to deprecate these options for good.

My point was that the warning is appearing twice per path in rapid succession (so 4x for a single device in this case). Is it expected to process each path 2x for a single multipath -ll command issued?

Sort of. The pathinfo() function is called very frequently in the multipath code, to retrieve various aspects of a device. And the vendor is considered a very basic property that's always fetched, which causes the message every time. If you want to create a patch for this issue, go ahead. It remains a corner case from my perspective.

mwilck commented 1 year ago

This should be fixed by 88d46ea, on https://github.com/openSUSE/multipath-tools/tree/queue