redpanda-data / redpanda

Redpanda is a streaming data platform for developers. Kafka API compatible. 10x faster. No ZooKeeper. No JVM!
https://redpanda.com
9.62k stars 585 forks source link

Segfault running `rpk redpanda tune disk_irq` #17869

Closed voutilad closed 3 months ago

voutilad commented 6 months ago

Version & Environment

Redpanda version: 23.3.11

Running in Oracle Cloud:

[    0.000000] Linux version 5.15.0-202.135.2.el8uek.x86_64 (mockbuild@host-100-100-224-51) (gcc (GCC) 11.2.1 20220127 (Red Hat 11.2.1-9.2.0.1), GNU ld version 2.36.1-4.0.1.el8_6) #2 SMP Fri Jan 5 16:12:57 PST 2024
[    0.000000] Command line: BOOT_IMAGE=(hd0,gpt2)/vmlinuz-5.15.0-202.135.2.el8uek.x86_64 root=/dev/mapper/ocivolume-root ro crashkernel=auto LANG=en_US.UTF-8 console=tty0 console=ttyS0,115200 rd.luks=0 rd.md=0 rd.dm=0 rd.lvm.vg=ocivolume rd.lvm.lv=ocivolume/root rd.net.timeout.carrier=5 netroot=iscsi:169.254.0.2:::1:iqn.2015-02.oracle.boot:uefi rd.iscsi.param=node.session.timeo.replacement_timeout=6000 net.ifnames=1 nvme_core.shutdown_timeout=10 ipmi_si.tryacpi=0 ipmi_si.trydmi=0 libiscsi.debug_libiscsi_eh=1 loglevel=4 ip=dhcp,dhcp6 rd.net.timeout.dhcp=10 crash_kexec_post_notifiers

What went wrong?

root@rpk:/# rpk redpanda tune disk_irq
TUNER              APPLIED         ENABLED        SUPPORTED      ERROR
disk_irq  false  true  true  err=signal: segmentation fault (core dumped), stderr=

dmesg output after segfault:

[555736.445660] hwloc-distrib-r[1438130]: segfault at 21 ip 00007fa88ae41264 sp 00007fff08346000 error 4 in libhwloc.so.15[7fa88ae1f000+5a000]
[555736.450529] Code: 39 d1 75 f2 eb a1 31 c0 5b 41 5e c3 66 2e 0f 1f 84 00 00 00 00 00 55 41 57 41 56 41 55 41 54 53 48 83 ec 18 49 89 f4 49 89 fa <44> 8b 0e 44 8b 32 45 39 f1 45 89 f5 45 0f 47 e9 44 89 f5 41 0f 42

What should have happened instead?

Either error or correctly tuned.

How to reproduce the issue?

  1. Run a privileged pod in OCI's managed k8s using Redpanda 23.3.11 image.
  2. Run rpk redpanda mode prod
  3. Run rpk redpanda tune disk_irq

JIRA Link: CORE-2376

voutilad commented 6 months ago

Here's full debug output:

root@rpk:/# rpk redpanda tune disk_irq -v
19:17:24.584  DEBUG  Looking for interface with '[0.0.0.0 0.0.0.0]' addresses
19:17:24.585  DEBUG  Checking 'lo' address '127.0.0.1/8'
19:17:24.585  DEBUG  Checking 'eth0' address '10.0.10.45/24'
19:17:24.585  DEBUG  Creating disk IRQs tuner with mode 'def', cpu mask 'all', directories '[/var/lib/redpanda/data]' and devices '[]'                                                                                                      
19:17:24.585  DEBUG  Checking if 'hwloc-calc-redpanda' & 'hwloc-distrib-redpanda' are present...
19:17:24.585  DEBUG  Tuner parameters &{Mode: CPUMask:all RebootAllowed:false Disks:[] Directories:[/var/lib/redpanda/data] Nics:[eth0]}                                                                                                    
19:17:24.585  DEBUG  Collecting info about directory '/var/lib/redpanda/data'
19:17:24.585  DEBUG  Getting block device from path '/var/lib/redpanda/data'
19:17:24.585  DEBUG  Creating block device from number {8, 16}
19:17:24.585  DEBUG  Reading block device details from '/sys/devices/platform/host3/session8/target3:0:0/3:0:0:2/block/sdb'
19:17:24.585  DEBUG  Getting physical device from '/sys/devices/platform/host3/session8/target3:0:0/3:0:0:2/block/sdb'
19:17:24.585  DEBUG  Checking 'Disks IRQs affinity static'
19:17:24.585  DEBUG  Getting 'sdb' IRQs
19:17:24.585  DEBUG  Getting block device from path '/dev/sdb'
19:17:24.586  DEBUG  Creating block device from number {8, 16}
19:17:24.586  DEBUG  Reading block device details from '/sys/devices/platform/host3/session8/target3:0:0/3:0:0:2/block/sdb'
19:17:24.586  DEBUG  Getting controller path for '/sys/devices/platform/host3/session8/target3:0:0/3:0:0:2/block/sdb'
19:17:24.586  DEBUG  Reading IRQs of '/sys/devices/platform/host3/session8/target3:0:0/3:0:0:2/block/sdb', with deviceInfo name pattern 'blkif'                                                                                             
19:17:24.586  DEBUG  Reading '/proc/interrupts' file...
19:17:24.586  DEBUG  DeviceInfo '/sys/devices/platform/host3/session8/target3:0:0/3:0:0:2/block/sdb' IRQs '[]'
19:17:24.586  DEBUG  Checking if we are running on i3.metal amazon instance type
19:17:24.594  DEBUG  Running on 'No such metadata item' EC2 instance
19:17:24.594  DEBUG  Running command 'ps' with arguments '[--no-headers -C irqbalance]'
19:17:24.595  DEBUG  Check 'Disks IRQs affinity static' passed, skipping tuning
19:17:24.595  DEBUG  Checking 'Disks IRQs affinity set'
19:17:24.595  DEBUG  Getting [sdb] IRQs distribution with mode def and CPU mask all
19:17:24.595  DEBUG  Running command 'hwloc-calc-redpanda' with arguments '[all]'
19:17:24.606  DEBUG  Getting 'sdb' IRQs
19:17:24.606  DEBUG  Getting block device from path '/dev/sdb'
19:17:24.606  DEBUG  Creating block device from number {8, 16}
19:17:24.606  DEBUG  Reading block device details from '/sys/devices/platform/host3/session8/target3:0:0/3:0:0:2/block/sdb'
19:17:24.606  DEBUG  Getting controller path for '/sys/devices/platform/host3/session8/target3:0:0/3:0:0:2/block/sdb'
19:17:24.606  DEBUG  Reading IRQs of '/sys/devices/platform/host3/session8/target3:0:0/3:0:0:2/block/sdb', with deviceInfo name pattern 'blkif'                                                                                             
19:17:24.606  DEBUG  Reading '/proc/interrupts' file...
19:17:24.606  DEBUG  DeviceInfo '/sys/devices/platform/host3/session8/target3:0:0/3:0:0:2/block/sdb' IRQs '[]'
19:17:24.606  DEBUG  Checking if we are running on i3.metal amazon instance type
19:17:24.614  DEBUG  Running on 'No such metadata item' EC2 instance
19:17:24.614  DEBUG  Calculating default mode for Disk IRQs
19:17:24.614  DEBUG  Running command 'hwloc-calc-redpanda' with arguments '[--restrict 0x000000ff --number-of core machine:0]'                                                                                                              
19:17:24.625  DEBUG  Running command 'hwloc-calc-redpanda' with arguments '[--restrict 0x000000ff --number-of PU machine:0]'                                                                                                                
19:17:24.635  DEBUG  Considering '4' cores and '8' PUs
19:17:24.635  DEBUG  Computing IRQ CPU mask for 'sq' mode and input CPU mask '0x000000ff'
19:17:24.635  DEBUG  Computing CPU mask for 'sq' mode and input CPU mask '0x000000ff'
19:17:24.635  DEBUG  Running command 'hwloc-calc-redpanda' with arguments '[0x000000ff ~PU:0]'
19:17:24.644  DEBUG  Computations CPU mask '0x000000fe'
19:17:24.644  DEBUG  Running command 'hwloc-calc-redpanda' with arguments '[0x000000ff ~0x000000fe]'
19:17:24.654  DEBUG  IRQs CPU mask '0x00000001'
19:17:24.654  DEBUG  Running command 'hwloc-distrib-redpanda' with arguments '[0 --single --restrict 0x00000001]'
TUNER              APPLIED         ENABLED        SUPPORTED      ERROR
disk_irq  false  true  true  err=signal: segmentation fault (core dumped), stderr=
voutilad commented 6 months ago

For additional background, I have an Oracle Block Volume backing the PV mounted to:

/dev/sdb on /var/lib/redpanda/data type xfs (rw,relatime,seclabel,nouuid,attr2,inode64,logbufs=8,logbsize=32k,noquota)

It looks to be attached via SCSI:

[556929.998574] sd 3:0:0:2: [sdb] 104857600 512-byte logical blocks: (53.7 GB/50.0 GiB)
[556930.001978] sd 3:0:0:2: [sdb] 4096-byte physical blocks
[556930.004672] sd 3:0:0:2: [sdb] Write Protect is off
[556930.006880] sd 3:0:0:2: [sdb] Mode Sense: 2b 00 10 08
[556930.007135] sd 3:0:0:2: [sdb] Write cache: disabled, read cache: enabled, supports DPO and FUA
[556930.011066] sd 3:0:0:2: [sdb] Optimal transfer size 1048576 bytes
[556930.019158] sd 3:0:0:2: [sdb] Attached SCSI disk
StephanDollberg commented 6 months ago

I think this is a dupe of https://github.com/redpanda-data/core-internal/issues/1145

StephanDollberg commented 6 months ago

If you run into this again could you please try:

StephanDollberg commented 3 months ago

Closing this in favor of the above mentioned ticket which has some investigation already.