zpool status adds a "1" to WWN when multipath enabled

stuartthebruce commented 5 years ago

System information

Type	Version/Name
Distribution Name	Scientific Linux
Distribution Version	7.5
Linux Kernel	3.10.0-957.1.3.el7
Architecture	x86_64
ZFS Version	0.7.12-1
SPL Version	0.7.12-1

Describe the problem you're observing

After enabling multipath with "use_friendly_names no" in multipath.conf devices whose WWN ends in a "c" now show an extra "1" in the output of zpool status.

Before enabling multipath,

    NAME                        STATE     READ WRITE CKSUM
    jbod1-node806-data1         ONLINE       0     0     0
      raidz1-0                  ONLINE       0     0     0
        wwn-0x5000cca253077224  ONLINE       0     0     0
        wwn-0x5000cca253077640  ONLINE       0     0     0
        wwn-0x5000cca25308c90c  ONLINE       0     0     0
        wwn-0x5000cca25308e49c  ONLINE       0     0     0
        wwn-0x5000cca25308e95c  ONLINE       0     0     0
        wwn-0x5000cca2530c2410  ONLINE       0     0     0
        wwn-0x5000cca2530ca5ac  ONLINE       0     0     0
        wwn-0x5000cca2530e04d8  ONLINE       0     0     0
        wwn-0x5000cca2530e8568  ONLINE       0     0     0
        wwn-0x5000cca2530e8598  ONLINE       0     0     0

After enabling multipath,

        NAME                    STATE     READ WRITE CKSUM
        jbod1-node806-data1     ONLINE       0     0     0
          raidz1-0              ONLINE       0     0     0
            35000cca253077224   ONLINE       0     0     0
            35000cca253077640   ONLINE       0     0     0
            35000cca25308c90c1  ONLINE       0     0     0
            35000cca25308e49c1  ONLINE       0     0     0
            35000cca25308e95c1  ONLINE       0     0     0
            35000cca2530c2410   ONLINE       0     0     0
            35000cca2530ca5ac1  ONLINE       0     0     0
            35000cca2530e04d8   ONLINE       0     0     0
            35000cca2530e8568   ONLINE       0     0     0
            35000cca2530e8598   ONLINE       0     0     0

errors: No known data errors

Describe how to reproduce the problem

On a system with an existing zpool enable multipath via,

mpathconf --user_friendly_names n --with_multipathd y
shutdown -r now
(wait for reboot)
zpool import
zpool status

Include any warning/errors/backtraces from the system logs

Example:

this is an example how log text should be marked (wrap it with ```)

-->

devZer0 commented 5 years ago

what does "multipath -l -v1" show ?

maybe the device names being represented under /dev somewhere, can you have a look how they appear there ?

just to see if that is a mulitpath or zfs issue

stuartthebruce commented 4 years ago

That doesn't show any trailing characters, for example, on an SL7.7 system running ZFS 0.7.13

[root@node810 ~]# zpool status jbod2-node810-data1
  pool: jbod2-node810-data1
 state: ONLINE
  scan: scrub repaired 0B in 36h29m with 0 errors on Wed May  8 22:18:55 2019
config:

    NAME                    STATE     READ WRITE CKSUM
    jbod2-node810-data1     ONLINE       0     0     0
      raidz1-0              ONLINE       0     0     0
        35000cca2530aa110   ONLINE       0     0     0
        35000cca2530aa424   ONLINE       0     0     0
        35000cca2530aacb4   ONLINE       0     0     0
        35000cca2530e297c1  ONLINE       0     0     0
        35000cca2530f661c1  ONLINE       0     0     0
        35000cca253100b68   ONLINE       0     0     0
        35000cca253123c08   ONLINE       0     0     0
        35000cca253158878   ONLINE       0     0     0

errors: No known data errors

compared to,

[root@node810 ~]# multipath -l -v1 | grep -C5 35000cca2530e297c
35000cca2531d7500
35000cca2531e6404
35000cca2530a6f3c
35000cca2530a1140
35000cca2531e868c
35000cca2530e297c
35000cca253032d58
35000cca2530925bc
35000cca253178e50
35000cca2530a63f4
35000cca2531e5f6c

stuartthebruce commented 4 years ago

And here is what I find for one of the above WWN under /dev,

[root@node810 ~]# find /dev -name "*35000cca2530e297c*"
/dev/disk/by-id/scsi-35000cca2530e297c
/dev/disk/by-id/dm-uuid-part9-mpath-35000cca2530e297c
/dev/disk/by-id/dm-name-35000cca2530e297c9
/dev/disk/by-id/dm-uuid-part1-mpath-35000cca2530e297c
/dev/disk/by-id/dm-name-35000cca2530e297c1
/dev/disk/by-id/dm-uuid-mpath-35000cca2530e297c
/dev/disk/by-id/dm-name-35000cca2530e297c
/dev/mapper/35000cca2530e297c9
/dev/mapper/35000cca2530e297c1
/dev/mapper/35000cca2530e297c

where the extra "1" presumably comes from the partition table,

[root@node810 ~]# fdisk -l /dev/mapper/35000cca2530e297c
WARNING: fdisk GPT support is currently new, and therefore in an experimental phase. Use at your own discretion.

Disk /dev/mapper/35000cca2530e297c: 12000.1 GB, 12000138625024 bytes, 23437770752 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk label type: gpt
Disk identifier: 17DE1481-B0B3-DF4B-88F3-82E2413936D8

#         Start          End    Size  Type            Name
 1         2048  23437752319   10.9T  Solaris /usr &  zfs-16f1ad70fd6ed32f
 9  23437752320  23437768703      8M  Solaris reserve

stuartthebruce commented 4 years ago

FWIW, I updated to SL7.7 and zfs 0.8.3 and it shows the same behavior of sometimes displaying the partition number for WWN that end in a "c",

[root@node806 ~]# uname -a
Linux node806 3.10.0-1062.7.1.el7.x86_64 #1 SMP Thu Dec 5 14:45:00 CST 2019 x86_64 x86_64 x86_64 GNU/Linux

[root@node806 ~]# rpm -q zfs
zfs-0.8.3-1.el7.x86_64

[root@node806 ~]# zpool status
  pool: jbod1-node806-data1
 state: ONLINE
  scan: none requested
config:

    NAME                    STATE     READ WRITE CKSUM
    jbod1-node806-data1     ONLINE       0     0     0
      raidz1-0              ONLINE       0     0     0
        35000cca253077224   ONLINE       0     0     0
        35000cca253077640   ONLINE       0     0     0
        35000cca25308c90c1  ONLINE       0     0     0
        35000cca25308e49c1  ONLINE       0     0     0
        35000cca25308e95c1  ONLINE       0     0     0
        35000cca2530c2410   ONLINE       0     0     0
        35000cca2530ca5ac1  ONLINE       0     0     0
        35000cca2530e04d8   ONLINE       0     0     0
        35000cca2530e8568   ONLINE       0     0     0
        35000cca2530e8598   ONLINE       0     0     0

errors: No known data errors

stale[bot] commented 3 years ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

griznog commented 6 months ago

Hi,

I'm seeing this with:

[root@storage-seq-1 ~]# rpm -qa | grep zfs | sort
kmod-zfs-2.1.13-1.el8.x86_64
libzfs5-2.1.13-1.el8.x86_64
zfs-2.1.13-1.el8.x86_64
[root@storage-seq-1 ~]# uname -a
Linux storage-seq-1 4.18.0-477.27.1.el8_8.x86_64 #1 SMP Wed Sep 20 15:55:39 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
[root@storage-seq-1 ~]# cat /etc/rocky-release
Rocky Linux release 8.8 (Green Obsidian)

Example:

  pool: jbodpool
 state: ONLINE
  scan: resilvered 1.82T in 04:39:13 with 0 errors on Fri Mar 22 18:21:28 2024
config:

    NAME                    STATE     READ WRITE CKSUM
    jbodpool                ONLINE       0     0     0
      draid3:8d:102c:6s-0   ONLINE       0     0     0
        35000cca2be7567b8   ONLINE       0     0     0
        35000cca2be4f696c1  ONLINE       0     0     0
        35000cca2be0f2454   ONLINE       0     0     0
        35000cca2be1eb180   ONLINE       0     0     0
        35000cca2be760f1c1  ONLINE       0     0     0
        35000cca2be760f44   ONLINE       0     0     0
        35000cca2be4f6b44   ONLINE       0     0     0
        35000cca2be1ed5a0   ONLINE       0     0     0
        35000cca2be1def80   ONLINE       0     0     0
        35000cca2be1d4760   ONLINE       0     0     0
        35000cca2be1dff68   ONLINE       0     0     0
        35000cca2be1e5708   ONLINE       0     0     0
        35000cca2be1e9644   ONLINE       0     0     0
        35000cca2be1e2980   ONLINE       0     0     0
        35000cca2be4f773c1  ONLINE       0     0     0
        35000cca2be4f74a4   ONLINE       0     0     0
        35000cca2be760dac1  ONLINE       0     0     0
        35000cca2be756724   ONLINE       0     0     0
        35000cca2be0f66bc1  ONLINE       0     0     0
        35000cca2be02ab7c1  ONLINE       0     0     0
        35000cca2be5f506c1  ONLINE       0     0     0

Example from /dev/mapper showing how WWNs that end in c get a 1 and 9 appended for the partitions, rather than a p1 and p9:

lrwxrwxrwx  1 root root       9 Mar 22 13:41 35000cca2be7b2f9c -> ../dm-849
lrwxrwxrwx  1 root root       9 Mar 22 13:42 35000cca2be7b2f9c1 -> ../dm-853
lrwxrwxrwx  1 root root       9 Mar 22 13:40 35000cca2be7b2f9c9 -> ../dm-854
lrwxrwxrwx  1 root root       8 Mar 22 13:40 35000cca2be7c21c4 -> ../dm-17
lrwxrwxrwx  1 root root       8 Mar 22 13:42 35000cca2be7c21c4p1 -> ../dm-24
lrwxrwxrwx  1 root root       8 Mar 22 13:40 35000cca2be7c21c4p9 -> ../dm-28

I don't actually understand why those 1 and 9 mappings are even present. On two similar servers they don't show up. The difference in my case is that the server with this problem had the pool set up before multipath was enabled.

griznog commented 6 months ago

More testing, if I add

    skip_kpartx  yes

to my /etc/multipath.conf, then multipathd no longer creates the mapped partition devices. However, zpool import now fails with all devices being UNAVIL:

[root@storage-seq-1 ~]# zpool import -d /dev/mapper
   pool: jbodpool
     id: 11481882034934482336
  state: UNAVAIL
status: One or more devices contains corrupted data.
 action: The pool cannot be imported due to damaged devices or data.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-5E
 config:

    jbodpool                UNAVAIL  insufficient replicas
      draid3:8d:102c:6s-0   UNAVAIL  insufficient replicas
        35000cca2be7567b8   UNAVAIL
        35000cca2be4f696c1  UNAVAIL
        35000cca2be0f2454   UNAVAIL
        35000cca2be1eb180   UNAVAIL
        35000cca2be760f1c1  UNAVAIL
        35000cca2be760f44   UNAVAIL
        35000cca2be4f6b44   UNAVAIL
        35000cca2be1ed5a0   UNAVAIL
        35000cca2be1def80   UNAVAIL
        35000cca2be1d4760   UNAVAIL
        35000cca2be1dff68   UNAVAIL
        35000cca2be1e5708   UNAVAIL
        35000cca2be1e9644   UNAVAIL
        35000cca2be1e2980   UNAVAIL
        35000cca2be4f773c1  UNAVAIL
        35000cca2be4f74a4   UNAVAIL
        35000cca2be760dac1  UNAVAIL
        35000cca2be756724   UNAVAIL
        35000cca2be0f66bc1  UNAVAIL
        35000cca2be02ab7c1  UNAVAIL
        35000cca2be5f506c1  UNAVAIL
        35000cca2be1e31d8   UNAVAIL
...

The pool should have 306 drives, but only one of the three draid3 vdevs shows up and in it only a single drive appears as ONLINE.

rincebrain commented 6 months ago

Yes, because it's the partition devices that are in the pool, it just hides the "-part1".

griznog commented 6 months ago

Yes, because it's the partition devices that are in the pool, it just hides the "-part1".

Why do the pools I created directly on multipath mapped devices not do this? Does ZFS do something different when given a multipath device?

rincebrain commented 6 months ago

I'm not aware of ZFS having any multipath-specific code.

I'm not sure I understand the question.

griznog commented 6 months ago

As I noted above, I have numerous similar servers where the pool was created after multipath was enabled, none of those exhibit this behavior nor does multipath map devices for partitions. This problematic server had the pool created before multipath was enabled, and now maps the 1 and 9 partitions as devices.

I thought that by adding skip_kpartx and preventing multipath from mapping partitions to devices that I'd force the non-mapped partition behavior, but that just results in a pool that can't be imported.

So in light of the observed behavior, I'm asking does ZFS do anything differently when creating a new pool on multipath mapped devices than it does with regular single path devices?

griznog commented 6 months ago

And I think I answered my own question, a disk in a pool created on multipath mapped devices results in:

[root@storage-odb2-1 ~]# blkid /dev/mapper/35000cca2a601d650
/dev/mapper/35000cca2a601d650: LABEL="datapool" UUID="12922485268461784626" UUID_SUB="6113558819263140911" TYPE="zfs_member"
[root@storage-odb2-1 ~]# gdisk -l /dev/mapper/35000cca2a601d650
GPT fdisk (gdisk) version 1.0.3

Warning: Partition table header claims that the size of partition table
entries is 0 bytes, but this program  supports only 128-byte entries.
Adjusting accordingly, but partition table may be garbage.
Warning: Partition table header claims that the size of partition table
entries is 0 bytes, but this program  supports only 128-byte entries.
Adjusting accordingly, but partition table may be garbage.
Partition table scan:
  MBR: not present
  BSD: not present
  APM: not present
  GPT: not present

A pool created on single path devices with multipath enabled afterwards has this for the raw mapped device:

[root@storage-seq-1 ~]# blkid /dev/mapper/35000cca2be1ed890
/dev/mapper/35000cca2be1ed890: PTUUID="c445b2c8-c715-7943-ba9b-8fc7ee0f16b3" PTTYPE="gpt"
[root@storage-seq-1 ~]# gdisk -l /dev/mapper/35000cca2be1ed890
GPT fdisk (gdisk) version 1.0.3

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: present

Found valid GPT with protective MBR; using GPT.
Disk /dev/mapper/35000cca2be1ed890: 42970644480 sectors, 20.0 TiB
Sector size (logical/physical): 512/4096 bytes
Disk identifier (GUID): C445B2C8-C715-7943-BA9B-8FC7EE0F16B3
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 34, last usable sector is 42970644446
Partitions will be aligned on 2048-sector boundaries
Total free space is 4029 sectors (2.0 MiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048     42970626047   20.0 TiB    BF01  zfs-69c2fcb0ef85e748
   9     42970626048     42970642431   8.0 MiB     BF07

This seems to indicate that when creating a pool on multipath mapped devices, the disk isn't partitioned? I'm also wondering now if I take the drives in a pool created on multipath devices to a different server without multipath, will the pool import successfully?

rincebrain commented 6 months ago

It'd import fine.

AIUI, on Linux, if it recognizes that the thing it's using as a "disk" is a whole "disk", it will partition it, then hide the partition in the status output. (On FreeBSD, it just uses whatever you give it, up to and including an entire unpartitioned disk.)

So I would assume it's treating the multipath device as "not a whole real disk" for this purpose, and not partitioning it when you use it for create, then for import, it's finding the ZFS metadata on the "partition" device, since it doesn't know anything about the relationship between the arbitrarily named devices.

Why the naming is different, you'd have to examine the multipath code to figure out, I'd guess.

griznog commented 6 months ago

Thanks @rincebrain , I guess that makes sense to me. Looks like a backup/destroy pool/re-create pool/restore is my next step to get this server to behave like my other multipath servers/pools.

rincebrain commented 6 months ago

If that's your simplest option, maybe.

You could also do something like drop one or two disks, let the DHSs rebuild, then replace them with their whole multipathed versions, possibly using zpool labelclear on them so it doesn't complain they were once in the pool in the interim.

Or just remake the pool.

But either way, hitting that with 100+ disks in the pool is going to be cumbersome. :(

griznog commented 6 months ago

Another thing to note, in the many reboots done while troubleshooting this, occasionally one or a handful of devices will fail to get the partition mapped by multipathd, and the pool will show them as UNAVAIL. A zpool replace poolname OLDID /dev/mapper/WWNAME will replace the drive with itself and use the full device without partitioning, as happens when creating a pool on multipath mapped devices. I suppose given enough reboots or just detaching and replacing the partitioned devices with themselves would eventually get the pool to be like any other pool created on multipath mapped devices and it would import correctly without the need to map partitions.

lundman commented 6 months ago

Presumably related to this code: https://github.com/openzfs/zfs/blob/master/lib/libzutil/os/linux/zutil_device_path_os.c#L47-L71 that just adds a "1" in some cases.

griznog commented 6 months ago

Presumably related to this code: https://github.com/openzfs/zfs/blob/master/lib/libzutil/os/linux/zutil_device_path_os.c#L47-L71 that just adds a "1" in some cases.

This seems to be where the suffix strangeness comes from. For the 1200 drives we have the WWN based multipath names all end either in 0-9 or c. I'm guessing this is a property of all WWNs? Curious as to why it matters to zfs in this bit of code though. Is this a rule meant for some other type of device?

Also does ZFS somehow trigger the mapping of these since it seems to also set the names?

griznog commented 6 months ago

From WIkipedia (https://en.wikipedia.org/wiki/World_Wide_Name):

Each WWN is an 8- or 16-byte number, the length and format of which is determined by the most significant four bits, which are referred to as an NAA (Network Address Authority). The remainder of the value is derived from an IEEE OUI (or from Company Id (CID)) and vendor-supplied information. Each format defines a different way to arrange and/or interpret these components.

My pattern of all WWNs ending in [0-9] or c must be a Western Digital/HGST format.

devZer0 commented 6 months ago

hello,

i have dug in git a little bit and this is some commits from 14 years ago

https://github.com/openzfs/zfs/commit/a2c6816c34952eb6dad51248d31172189fba9126 - Support shorthand names with zpool remove

https://github.com/openzfs/zfs/commit/83c62c939938ca5915a61022208a31c4ab3faa1c - Strip partition from device name for whole disks

/*

Remove partition suffix from a vdev path. Partition suffixes may take three
forms: "-partX", "pX", or "X", where X is a string of digits. The second
case only occurs when the suffix is preceded by a digit, i.e. "md0p0" The
third case only occurs when preceded by a string matching the regular
expression "^[hs]d[a-z]+", i.e. a scsi or ide disk. */

griznog commented 6 months ago

After poking this pool for several days, I think this isn't really a "bug", just a really confusing "feature". I don't understand why devices get treated differently between multipath and non-multipath, I would have assumed always using the device without partitioning would be better.

The TL;DR here is that if you want to have a pool be on multipath devices, create it on multipath devices. In my case I did not do that because I was waiting on an order of SAS cables that would allow me to cable for multipath and thought I'd get a head start getting data onto the system while waiting. Lesson learned, don't do that or if I ever do need to do that again, I'll figure out how to force everything to be mapped by multipath even if just single paths.

From my perspective (user who knows nothing about ZFS internals) this can be closed as I don't see anything to fix here other than it'd have been nice to read this in some docs at some point.

openzfs / zfs