Closed gowrishankarm closed 6 years ago
------- Comment From bssrikanth@in.ibm.com 2017-11-02 05:21:08 EDT------- Discussed with Gowri, this issue is always recreatable with ftrace function on.. and after a while host is hung.. hard reset of host is required in order to recover from this issue..
------- Comment From satheera@in.ibm.com 2017-11-24 02:10:54 EDT------- (In reply to comment #11) > (In reply to comment #10) > > (In reply to comment #9) > > > @Satheesh, > > > > > > Any updates on this? > > > > Naveen, > > do you think > > https://patchwork.kernel.org/patch/10042881/ would help in this case. > > do you know if your fix made it to 4.14-rc4? > > From the report above, this looks to be a ppc64le host. The patch above is > for BE, so it is probably not applicable in this case. I am not able to > conclude anything from the provided traces. > > Satheesh, > Could you please test v4.14 to see if this problem can be reproduced there? > If so, it would also be good to see if reverting commit > 6b847d795cf4ab3e574f4fcf7193fe245908a195 helps. That commit changed which > functions get traced, so it might have an impact here.
I am able to hit with latest devel(4.14) 4.14.0-3.dev.git68b4afb.el7.centos.ppc64le
Steps to reproduce:
Run the below script in host, wait at ""press ENTER key to cancel" step,
mkdir -p /debug mount -t debugfs nodev /debug 2>&1 echo '*' >/debug/tracing/set_ftrace_filter echo function >/debug/tracing/current_tracer echo 1 >/debug/tracing/tracing_on read -p "press ENTER key to cancel .." var echo 0 >/debug/tracing/tracing_on cat /debug/tracing/trace > /tmp/tracing.out$$ echo "/tmp/tracing.out$$ is created .."
Run a vm start stop in another terminal, able to hit hardlock immediately for i in {1..20};do virsh destroy vm1;virsh start vm1;virsh domstate vm1;sleep 2;done
Message from syslogd@localhost at Nov 24 02:03:07 ... kernel:Watchdog CPU:56 detected Hard LOCKUP other CPUS:24 error: Failed to destroy domain vm1 error: Failed to terminate process 5481 with SIGKILL: Device or resource busy
error: Domain is already active
in shutdown
error: Failed to destroy domain vm1 error: Failed to terminate process 5481 with SIGKILL: Device or resource busy
Regards, -Satheesh
------- Comment From naveen.n.rao@in.ibm.com 2017-11-24 05:33:38 EDT------- > > Satheesh, > > Could you please test v4.14 to see if this problem can be reproduced there? > > If so, it would also be good to see if reverting commit > > 6b847d795cf4ab3e574f4fcf7193fe245908a195 helps. That commit changed which > > functions get traced, so it might have an impact here. > > I am able to hit with latest devel(4.14) > 4.14.0-3.dev.git68b4afb.el7.centos.ppc64le
Thanks for confirming. It is not clear from your message if you tried by reverting the patch I mentioned above. There also seems to have been many watchdog fixes that went into v4.15, so it will be good to test with those for more reliable traces. I can take a look if you can spare the test machine for a day next week?
------- Comment From gowrishankar.m@in.ibm.com 2017-11-02 04:10:28 EDT------- https://github.com/open-power-host-os/linux/issues/22
------- Comment From bssrikanth@in.ibm.com 2017-11-02 04:21:08 EDT-------
------- Comment From brsriniv@in.ibm.com 2017-11-02 05:54:46 EDT------- It is unclear if the behavior is being reported on the guest or the host. We do notice devices resetting in the logs, so if these messages are from the guest, what is happening on the host? The lockups seem to coincide with the devices being locked up.
Can we please get sosreport from the host and the guest? There is not really enough data here to proceed on this.
------- Comment From bssrikanth@in.ibm.com 2017-11-02 05:57:23 EDT------- (In reply to comment #4) > It is unclear if the behavior is being reported on the guest or the host. We > do notice devices resetting in the logs, so if these messages are from the > guest, what is happening on the host? The lockups seem to coincide with the > devices being locked up. > Can we please get sosreport from the host and the guest? There is not really > enough data here to proceed on this.
From what I heard from Gowri and description of this issue.. hard lockup happens while booting guest with ftrace.. @Gowri can you please help with confirming above and providing sosreports?
------- Comment From bssrikanth@in.ibm.com 2017-11-02 06:00:02 EDT------- (In reply to comment #5) > @Gowri can you please help with confirming above and providing sosreports?
Forgot to mention: @Gowri sosreport running as-is on hostos has issues... instead run sosreport -n powerpc
...
------- Comment From gowrishankar.m@in.ibm.com 2017-11-02 09:01:42 EDT------- Reported trace was from host kernel, as we enable ftrace in host and then start guest.
Btw I am not quite sure how it would be possible to run sosreport as I could not execute any command once hard lockup detected and I find current ssh login as well as ipmi console login stops responding momentarily.
@Srikant, hope your team would be able to assist providing any other needed info as server is not with me (I spotted it while debugging some other bug).
------- Comment From bssrikanth@in.ibm.com 2017-11-02 09:09:30 EDT------- (In reply to comment #7) > Reported trace was from host kernel, as we enable ftrace in host and then > start guest. > Btw I am not quite sure how it would be possible to run sosreport as I could > not execute any command once hard lockup detected and I find current ssh > login as well as ipmi console login stops responding momentarily. > @Srikant, hope your team would be able to assist providing any other needed > info as server is not with me (I spotted it while debugging some other bug).
Sure Gowri. Since we cannot run sosreport after hitting issue, we will try to reset host and capture logs.. @Satheesh would help here in getting required logs..
------- Comment From brsriniv@in.ibm.com 2017-11-21 23:43:36 EDT------- @Satheesh,
Any updates on this?
------- Comment From srikar.dronamraju@in.ibm.com 2017-11-22 00:53:30 EDT------- (In reply to comment #9) > @Satheesh, > > Any updates on this?
Naveen, do you think https://patchwork.kernel.org/patch/10042881/ would help in this case. do you know if your fix made it to 4.14-rc4?
------- Comment From naveen.n.rao@in.ibm.com 2017-11-23 11:38:01 EDT------- (In reply to comment #10) > (In reply to comment #9) > > @Satheesh, > > Any updates on this? > > Naveen, > do you think > https://patchwork.kernel.org/patch/10042881/ would help in this case. > do you know if your fix made it to 4.14-rc4?
From the report above, this looks to be a ppc64le host. The patch above is for BE, so it is probably not applicable in this case. I am not able to conclude anything from the provided traces.
Satheesh, Could you please test v4.14 to see if this problem can be reproduced there? If so, it would also be good to see if reverting commit 6b847d795cf4ab3e574f4fcf7193fe245908a195 helps. That commit changed which functions get traced, so it might have an impact here.
------- Comment From satheera@in.ibm.com 2017-11-24 01:10:54 EDT------- > > > >
------- Comment From satheera@in.ibm.com 2017-11-24 01:14:09 EDT------- Created attachment 122559 host cal trace
------- Comment From naveen.n.rao@in.ibm.com 2017-11-24 04:33:38 EDT------- >
------- Comment From satheera@in.ibm.com 2017-11-28 08:12:02 EDT------- (In reply to comment #14) > > > Satheesh, > > > Could you please test v4.14 to see if this problem can be reproduced there? > > > If so, it would also be good to see if reverting commit > > > 6b847d795cf4ab3e574f4fcf7193fe245908a195 helps. That commit changed which > > > functions get traced, so it might have an impact here. > > > > I am able to hit with latest devel(4.14) > > 4.14.0-3.dev.git68b4afb.el7.centos.ppc64le > > Thanks for confirming. It is not clear from your message if you tried by > reverting the patch I mentioned above. There also seems to have been many > watchdog fixes that went into v4.15, so it will be good to test with those > for more reliable traces. I can take a look if you can spare the test > machine for a day next week?
I had tried with latest hostos devel branch, yet to tryout with your suggestion of reverting that commit, planning to try out now, will update once have the results.
Regards, -Satheesh.
------- Comment From satheera@in.ibm.com 2017-11-28 08:51:21 EDT------- Hi Naveen,
I am able to hit hardlockup even with reverting commit 6b847d795cf4ab3e574f4fcf7193fe245908a195, test system: ltc-test-ci1.aus.stglabs.ibm.com password: passw0rd
[ 226.151759] virbr0: topology change detected, propagating [17582052814.647948] Delta way too big! 17582052588050456478 ts=17582052814647941894 write stamp = 226597485416 [ 244.398673] Watchdog CPU:24 detected Hard LOCKUP other CPUS:0 [ 286.645022] INFO: rcu_sched detected stalls on CPUs/tasks: [ 286.645131] 0-...: (1 GPs behind) idle=a82/140000000000000/0 softirq=3930/3930 fqs=2906 [ 286.645217] (detected by 32, t=6002 jiffies, g=2328, c=2327, q=6125) [ 286.645322] Sending NMI from CPU 32 to CPUs 0: [ 297.706160] rcu_sched kthread starved for 1104 jiffies! g2328 c2327 f0x0 RCU_GP_DOING_FQS(4) ->state=0x0 ->cpu=64 [ 297.706259] rcu_sched I 0 9 2 0x00000800 [ 297.706337] Call Trace: [ 297.706383] [c0000007f864f8c0] [c000000000063b68] ftrace_call+0x4/0xbc (unreliable) [ 297.706496] [c0000007f864fa90] [c00000000001b038] __switch_to+0x2f8/0x440 [ 297.706589] [c0000007f864faf0] [c000000000b09088] __schedule+0x2a8/0x9e0 [ 297.706679] [c0000007f864fbc0] [c000000000b09808] schedule+0x48/0xc0 [ 297.706770] [c0000007f864fbf0] [c000000000b0e790] schedule_timeout+0x1f0/0x4d0 [ 297.706875] [c0000007f864fce0] [c00000000018dc0c] rcu_gp_kthread+0x4fc/0xa60 [ 297.706979] [c0000007f864fdc0] [c00000000012beb8] kthread+0x168/0x1b0 [ 297.707071] [c0000007f864fe30] [c00000000000bc60] ret_from_kernel_thread+0x5c/0x7c [ 394.696906] Watchdog CPU:16 detected Hard LOCKUP other CPUS:40 [ 394.697040] Watchdog CPU:40 Hard LOCKUP [ 394.697106] Modules linked in: vhost_net vhost tap act_police cls_u32 sch_ingress cls_fw sch_sfq sch_htb xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables i2c_opal ses i2c_core enclosure scsi_transport_sas ipmi_powernv ipmi_devintf ipmi_msghandler powernv_op_panel nfsd auth_rpcgss oid_registry nfs_acl lockd grace kvm_hv sunrpc kvm_pr kvm xfs libcrc32c tg3 ptp pps_core [ 394.698496] CPU: 40 PID: 2789 Comm: python2 Not tainted 4.14.0+ #3 [ 394.698579] task: c0000007ca8b9080 task.stack: c00000079e24c000 [ 394.698661] NIP: c0000000001b91e8 LR: c0000000001b9188 CTR: c00000000008ff10 [ 394.698756] REGS: c00000079e24f780 TRAP: 0501 Not tainted (4.14.0+) [ 394.698838] MSR: 900000010280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> CR: 44244824 XER: 20000000 [ 394.699127] CFAR: c0000000001b91f0 SOFTE: 1 [ 394.699127] GPR00: c0000000001b9188 c00000079e24fa00 c000000001403900 0000000000000000 [ 394.699127] GPR04: 0000000000000400 0000000000000000 0000000000000000 c00000000fd60000 [ 394.699127] GPR08: 0000000000000001 0000000000000003 c0000007ff528ee0 0000000000000010 [ 394.699127] GPR12: c00000000008ff00 c00000000fd7a400 [ 394.699684] NIP [c0000000001b91e8] smp_call_function_many+0x388/0x420 [ 394.699769] LR [c0000000001b9188] smp_call_function_many+0x328/0x420 [ 394.699849] Call Trace: [ 394.699894] [c00000079e24fa00] [c0000000001b9188] smp_call_function_many+0x328/0x420 (unreliable) [ 394.700029] [c00000079e24fa70] [c000000000070ff8] pmdp_invalidate+0x98/0xe0 [ 394.700139] [c00000079e24faa0] [c00000000033b41c] change_huge_pmd+0x8c/0x390 [ 394.700325] [c00000079e24fb10] [c0000000002e6c3c] change_protection+0xb1c/0xfb0 [ 394.700512] [c00000079e24fca0] [c0000000003164f8] change_prot_numa+0x38/0xb0 [ 394.700698] [c00000079e24fcd0] [c000000000144510] task_numa_work+0x2f0/0x400 [ 394.700884] [c00000079e24fda0] [c000000000129100] task_work_run+0x140/0x1a0 [ 394.701045] [c00000079e24fe00] [c00000000001ca90] do_notify_resume+0xf0/0x100 [ 394.701233] [c00000079e24fe30] [c00000000000bec4] ret_from_except_lite+0x70/0x74 [ 394.701414] Instruction dump: [ 394.701519] 812a0018 792807e1 41820034 4800001c 60000000 60000000 60000000 60000000 [ 394.701786] 60000000 60420000 7c210b78 7c421378 <812a0018> 792807e1 4082fff0 7c2004ac
Regards, -Satheesh
------- Comment From nevdull@us.ibm.com 2018-01-10 19:58:24 EDT------- Naveen, Satheesh - this seems to have stalled. Is there a clear 'next action' here? This is, after all, listed as a Ship Issue.
------- Comment From nevdull@us.ibm.com 2018-01-30 18:04:09 EDT------- Downgrading to a normal, in the absence of any new information from Naveen or Satheesh.
It sounds like from comment 7 that we need a host running ftrace and then when we start a guest we will see the host develop problems. Correct?
Can I have access to a host that I can use to investigate this further? ltc-test-ci1.aus.stglabs.ibm.com no longer seems available.
------- Comment From naveen.n.rao@in.ibm.com 2018-01-31 03:11:34 EDT------- Rick, Yes, sorry, I should have put out an update here. I discussed this with Satheesh in the context of https://bugzilla.linux.ibm.com/show_bug.cgi?id=164049 on Monday. I have started looking into this and should hopefully have some update this week.
------- Comment From nevdull@us.ibm.com 2018-02-01 16:54:18 EDT------- Added dependency to 164049. Depending on what Naveen and Satheesh discover, may even be a dup.
------- Comment From nevdull@us.ibm.com 2018-02-22 17:23:49 EDT------- Naveen, any update on this?
------- Comment From naveen.n.rao@in.ibm.com 2018-03-12 04:51:36 EDT------- I missed putting an update here, but only did so on https://bugzilla.linux.ibm.com/show_bug.cgi?id=164049 . I am working on an upstream fix for this.
------- Comment From seg@us.ibm.com 2018-08-31 13:32:05 EDT------- Just declaring this fixed.