platinasystems / go

Other
9 stars 68 forks source link

[sr-iov] panic: pci 0000:04:00.0 Broadcom 0xb960: open /dev/vfio/15: no such file or directory #64

Open donnlee opened 7 years ago

donnlee commented 7 years ago

After installing latest goes-installer, goes status fails and panic seen in syslog. Previous goes version was built on 2017/07/13.


donn@invader7:~$ sudo ./goes-platina-mk1-installer.0727
[sudo] password for donn:
Device "eth-29-0" does not exist.
Cannot find device "eth-29-0"
Device "eth-27-0" does not exist.
Cannot find device "eth-27-0"
Device "eth-25-0" does not exist.
Cannot find device "eth-25-0"
start: (platina,vnet.ready) timeout
install: start: exit status 1

donn@invader7:~$ sudo goes status
GOES status
======================
  PCI             - OK
  Check daemons   - Not OK
status: vnetd daemon not running

donn@invader7:~$ sudo less /var/log/syslog

Jul 27 13:36:48 invader7 kernel: ixgbe 0000:03:00.1 eth2: VF Reset msg received from vf 15
Jul 27 13:36:48 invader7 kernel: ixgbe 0000:03:00.0 eth1: NIC Link is Down
Jul 27 13:36:48 invader7 kernel: ixgbe 0000:03:00.1 eth2: NIC Link is Down
Jul 27 13:36:48 invader7 goes.vnetd[429]: panic: pci 0000:04:00.0 Broadcom 0xb960: open /dev/vfio/15: no such file or directory
Jul 27 13:36:49 invader7 goes.vnetd[429]:
Jul 27 13:36:49 invader7 goes.vnetd[429]: goroutine 1 [running]:
Jul 27 13:36:49 invader7 goes.vnetd[429]: github.com/platinasystems/go/vnet.(*Vnet).Run.func1(0xc42019c000)
Jul 27 13:36:49 invader7 goes.vnetd[429]:         /home/jenkins/workspace/go/src/github.com/platinasystems/go/vnet/vnet.go:59 +0xfe
Jul 27 13:36:49 invader7 goes.vnetd[429]: github.com/platinasystems/go/elib/loop.(*Loop).callInitHooks(0xc42019c000)
Jul 27 13:36:49 invader7 goes.vnetd[429]:         /home/jenkins/workspace/go/src/github.com/platinasystems/go/elib/loop/loop.go:394 +0x5d
Jul 27 13:36:49 invader7 goes.vnetd[429]: github.com/platinasystems/go/elib/loop.(*Loop).Run(0xc42019c000)
Jul 27 13:36:49 invader7 goes.vnetd[429]:         /home/jenkins/workspace/go/src/github.com/platinasystems/go/elib/loop/loop.go:470 +0x103
Jul 27 13:36:49 invader7 goes.vnetd[429]: github.com/platinasystems/go/vnet.(*Vnet).Run(0xc42019c000, 0xc42016c380, 0x1, 0x1)
Jul 27 13:36:49 invader7 goes.vnetd[429]:         /home/jenkins/workspace/go/src/github.com/platinasystems/go/vnet/vnet.go:62 +0xad
Jul 27 13:36:49 invader7 goes.vnetd[429]: github.com/platinasystems/go/goes/cmd/vnetd.(*Command).Main(0xc42019c000, 0xc4200101d0, 0x0, 0x0, 0x0,
0x0)
Jul 27 13:36:49 invader7 goes.vnetd[429]:         /home/jenkins/workspace/go/src/github.com/platinasystems/go/goes/cmd/vnetd/vnetd.go:126 +0x626
Jul 27 13:36:49 invader7 goes.vnetd[429]: github.com/platinasystems/go/goes.(*Goes).Main(0xc42016c2a0, 0xc4200101d0, 0x1, 0x1, 0x0, 0x0)
Jul 27 13:36:49 invader7 goes.vnetd[429]:         /home/jenkins/workspace/go/src/github.com/platinasystems/go/goes/goes.go:286 +0x9a2
Jul 27 13:36:49 invader7 goes.vnetd[429]: main.main()
Jul 27 13:36:49 invader7 goes.vnetd[429]:         /home/jenkins/workspace/go/src/github.com/platinasystems/go/main/goes-platina-mk1/main.go:19 +0x57
Jul 27 13:36:49 invader7 goes.vnetd[429]: exit status 2
Jul 27 13:37:42 invader7 ntpd[471]: Deleting interface #804 eth-10-0, fe80::46:8aff:fe00:a00#123, interface stats: received=0, sent=0, dropped=0, active_time=72300 secs
Jul 27 13:37:42 invader7 ntpd[471]: Deleting interface #803 eth-28-0, fe80::46:8aff:fe00:a12#123, interface stats: received=0, sent=0, dropped=0, active_time=72300 secs
donnlee commented 7 years ago

Reattempted goes-platina-mk1-installer and it failed in the same way. Did a COLD reboot and this time goes started ok.


donn@invader7:~$ uptime
 13:46:06 up 0 min,  1 user,  load average: 0.30, 0.07, 0.02

donn@invader7:~$ uname -a
Linux invader7 4.11.0-platina-mk1-amd64 #2 SMP Fri Jun 9 11:21:14 PDT 2017 x86_64 GNU/Linux

donn@invader7:~$ sudo goes status
[sudo] password for donn:
GOES status
======================
  PCI             - OK
  Check daemons   - OK
  Check Redis     - OK
  Check vnet      - OK
donnlee commented 7 years ago

Same symptoms, same thing happened when I upgraded invader2.

jasonlpang commented 7 years ago

Hi Donn, In your upgrade process is a reboot involved? If so, do you execute it with “reboot” or “reboot -f”? Also after the reboot if you do “lspci” does the TH 04:00.0 and 04:00.1 devices show up?

On alpha units (invader 1-15), please do “reboot -f” to make sure TH reliably shows up in lspci after a reboot.

thanks Jason

stigt commented 7 years ago

Try doing a "goes stop; rmmod uio-pci-dma" before the install of the vfio mode. Check if you have /etc/modprobe.d/goes-platina-mk1-modprobe.confi that's loading the module.

stig

On Thu, Jul 27, 2017 at 2:18 PM, Donn Lee notifications@github.com wrote:

Same symptoms, same thing happened when I upgraded invader2.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/platinasystems/go/issues/64#issuecomment-318489261, or mute the thread https://github.com/notifications/unsubscribe-auth/AAQZoA3F5VDN1IAfT2XnPoAPqQxvvktOks5sSP6UgaJpZM4Olz7_ .

donnlee commented 7 years ago

I upgraded coreboot (per Jason's email). Then I did 'reboot -f' and saw a scary looking crash (below). Going to try another cold boot next.

Last login: Thu Jul 27 14:31:50 PDT 2017 from 172.16.2.23 on pts/0
Linux invader2 4.11.0-platina-mk1-amd64 #1 SMP Thu May 11 22:06:03 PDT 2017 x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
donn@invader2:~$ sudo reboot -f
[sudo] password for donn:
Rebooting.
BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
sd 0:0:0:0: [sda] Synchronizing SCSI cache
IP: napi_hash_del+0x14/0x70
PGD 440fc0067
PUD 45edc5067
PMD 0

Oops: 0002 [#1] SMP
Modules linked in: xt_nat ixgbevf ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype ipta
ble_filter xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc overlay iptable_raw nls_utf8 nls_cp437 vfat fat kvm_intel kvm uio_pci_dma
 i2c_i801 autofs4 dm_mod ixgbe mdio
CPU: 5 PID: 8064 Comm: goes Not tainted 4.11.0-platina-mk1-amd64 #1
Hardware name: Intel Camelback Mountain Platina DC/Camelback Mountain Platina DC, BIOS coreboot-unknown 07/27/2017
task: ffff88046b162340 task.stack: ffffc90006c70000
RIP: 0010:napi_hash_del+0x14/0x70
RSP: 0018:ffffc90006c73b98 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000010 RCX: 0000000000000001
RDX: 0000000000000001 RSI: ffff88045efb84f0 RDI: ffffffff819a674c
RBP: ffffc90006c73ba0 R08: 0000000000000002 R09: 0000000000000000
R10: ffffc90006c73b60 R11: ffff88046b13de00 R12: 0000000000000001
R13: ffff88045efb8800 R14: 0000000000000000 R15: 0000000000000010
FS:  00007fa802ffd700(0000) GS:ffff88047fd40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000020 CR3: 000000045f19a000 CR4: 00000000003406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 ixgbevf_free_q_vectors+0x45/0x70 [ixgbevf]
 ixgbevf_clear_interrupt_scheme+0x9b/0xc0 [ixgbevf]
 ixgbevf_remove+0x44/0xb0 [ixgbevf]
 pci_device_remove+0x34/0xb0
 device_release_driver_internal+0x142/0x1f0
 device_release_driver+0xd/0x10
 pci_stop_bus_device+0x6b/0x80
 pci_stop_and_remove_bus_device+0xd/0x20
 pci_iov_remove_virtfn+0x9b/0x130
 ? pci_get_subsys+0x30/0x40
 pci_disable_sriov+0x37/0x110
 ixgbe_disable_sriov+0xc5/0x210 [ixgbe]
 ixgbe_pci_sriov_configure+0xeb/0x140 [ixgbe]
 sriov_numvfs_store+0x13f/0x190
 dev_attr_store+0x13/0x20
 sysfs_kf_write+0x32/0x40
 kernfs_fop_write+0x102/0x180
 __vfs_write+0x23/0x120
 ? __alloc_fd+0x3a/0x160
 vfs_write+0xaf/0x180
 SyS_write+0x41/0xb0
 entry_SYSCALL_64_fastpath+0x13/0x94
RIP: 0033:0x4885e4
RSP: 002b:000000c4210f9630 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00000000004885e4
RDX: 0000000000000002 RSI: 000000c420282280 RDI: 0000000000000007
RBP: 000000c4210f9788 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
R13: 0000000000000126 R14: 00000000000010f0 R15: 0000000000000200
Code: 89 53 20 48 89 08 48 89 4a 08 c6 05 46 2a 58 00 00 5b 5d c3 0f 1f 00 55 48 89 e5 53 48 89 fb 48 c7 c7 4c 67 9a 81 e8 cc 30 0d 00 <f0> 0f ba
 73 10 04 72 0c 31 c0 c6 05 17 2a 58 00 00 5b 5d c3 48
RIP: napi_hash_del+0x14/0x70 RSP: ffffc90006c73b98
CR2: 0000000000000020
---[ end trace d0b781331bdbac25 ]---
INFO: rcu_sched self-detected stall on CPU
        0-...: (5249 ticks this GP) idle=54d/140000000000001/0 softirq=15739/15739 fqs=2624
         (t=5250 jiffies g=3807 c=3806 q=993)
NMI backtrace for cpu 0
CPU: 0 PID: 10974 Comm: reboot Tainted: G      D         4.11.0-platina-mk1-amd64 #1
Hardware name: Intel Camelback Mountain Platina DC/Camelback Mountain Platina DC, BIOS coreboot-unknown 07/27/2017
Call Trace:
 <IRQ>
 dump_stack+0x4d/0x65
 nmi_cpu_backtrace+0x9b/0xa0
 ? irq_force_complete_move+0xf0/0xf0
 nmi_trigger_cpumask_backtrace+0x8f/0xc0
 arch_trigger_cpumask_backtrace+0x14/0x20
 rcu_dump_cpu_stacks+0x8f/0xca
 rcu_check_callbacks+0x651/0x7b0
 ? update_wall_time+0x448/0x770
 update_process_times+0x2a/0x50
 tick_sched_timer+0x48/0x160
 __hrtimer_run_queues+0x9c/0x110
 hrtimer_interrupt+0xa3/0x190
 local_apic_timer_interrupt+0x33/0x60
 smp_apic_timer_interrupt+0x33/0x50
 apic_timer_interrupt+0x86/0x90
RIP: 0010:queued_spin_lock_slowpath+0x15d/0x180
RSP: 0018:ffffc9000a37fcc0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff10
RAX: 0000000000000101 RBX: ffff88046b6d9050 RCX: 0000000000000101
RDX: 0000000000000101 RSI: 0000000000000001 RDI: ffffffff819a674c
RBP: ffffc9000a37fcc0 R08: 0000000000000001 R09: 00000000000ffff8
R10: ffffc9000a37fc48 R11: 0000000000000004 R12: ffff88046d27e800
R13: ffff88046d1c3000 R14: 0000000000000002 R15: ffffc9000a37fd87
 </IRQ>
 _raw_spin_lock+0x1b/0x20
 napi_hash_del+0x14/0x70
 netif_napi_del+0xd/0x70
 igb_reset_q_vector+0x4f/0x60
 igb_free_q_vectors+0x3d/0x80
 __igb_shutdown+0x5f/0x1d0
 igb_shutdown+0x17/0x50
 pci_device_shutdown+0x31/0x70
 device_shutdown+0xc9/0x180
 kernel_restart_prepare+0x31/0x40
 kernel_restart+0xd/0x60
 SyS_reboot+0xf4/0x1d0
 ? kmem_cache_alloc+0xf9/0x110
 ? __alloc_fd+0x3a/0x160
 ? vfs_writev+0x37/0x50
 ? __fdget_pos+0x12/0x50
 ? vfs_writev+0x37/0x50
 ? do_writev+0x49/0xb0
 entry_SYSCALL_64_fastpath+0x13/0x94
RIP: 0033:0x7fad98183b46
RSP: 002b:00007ffeb465db78 EFLAGS: 00000206 ORIG_RAX: 00000000000000a9
RAX: ffffffffffffffda RBX: 00007ffeb465d640 RCX: 00007fad98183b46
RDX: 0000000001234567 RSI: 0000000028121969 RDI: fffffffffee1dead
RBP: 00007ffeb465d8b0 R08: 00007ffeb465d250 R09: 00007ffeb465daa0
R10: 0000000000000002 R11: 0000000000000206 R12: 0000563899513742
R13: 00007ffeb465d7b8 R14: 0000000000000001 R15: 0000000000000014
INFO: rcu_sched self-detected stall on CPU
        0-...: (20946 ticks this GP) idle=54d/140000000000001/0 softirq=15739/15739 fqs=10463
         (t=21003 jiffies g=3807 c=3806 q=3251)
NMI backtrace for cpu 0
CPU: 0 PID: 10974 Comm: reboot Tainted: G      D         4.11.0-platina-mk1-amd64 #1
Hardware name: Intel Camelback Mountain Platina DC/Camelback Mountain Platina DC, BIOS coreboot-unknown 07/27/2017
Call Trace:
<repeats>
dlobete commented 7 years ago

Just commenting on the original headline - this is an error that indicates another instance of vnet has vfio opened so the second instance cannot start and panics.