open-power / op-test

Testing Firmware for OpenPOWER systems
Apache License 2.0
37 stars 85 forks source link

EMStress Thread-1 hangs indefinitely #573

Open kyle-ibm opened 4 years ago

kyle-ibm commented 4 years ago

EMStress test hangs indefinitely. Initial debug points to Thread 1 test (CPU Governor change test) hanging because no return value was detected after a couple of minutes.

SUT OS: RHEL7.6alt 4.14.0-115.16.1.el7a.ppc64le op-test LCB OS: Ubuntu 16.04.6 4.14.0-115.16.1.el7a.ppc64le

command: ./op-test -c SUT1 --run testcases.EMStress.RuntimeEMStress

~/op-test/test-reports/test-run-20200108014848$ ls -alh
total 1.5M
drwxrwxr-x  2 kloh kloh 4.0K Jan  8 01:49 .
drwxrwxr-x 19 kloh kloh 4.0K Jan  8 01:48 ..
-rw-rw-r--  1 kloh kloh  938 Jan  8 03:42 20200107174848837130.main.log
-rw-rw-r--  1 kloh kloh 620K Jan  8 05:20 20200107174848837593.debug.log
-rw-rw-r--  1 kloh kloh 254K Jan  8 04:48 20200108014848.log
-rw-rw-r--  1 kloh kloh 192K Jan  8 01:51 20200108014923-Thread-1.log    <==thread 1 hung after just 3 min
-rw-rw-r--  1 kloh kloh 3.8K Jan  8 05:20 20200108014927-Thread-2.log
-rw-rw-r--  1 kloh kloh  25K Jan  8 04:49 20200108014927-Thread-3.log
-rw-rw-r--  1 kloh kloh 163K Jan  8 05:09 20200108014928-Thread-4.log
-rw-rw-r--  1 kloh kloh 144K Jan  8 04:49 20200108014928-Thread-5.log

tail of Thread-1 log shows no return value at the end

$tail ~/op-test/test-reports/test-run-20200108014848/*Thread-1*
[console-expect]#for j in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo ondemand > $j; done
for j in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo ondemand > $j; done
bash: echo: write error: Invalid argument
bash: echo: write error: Invalid argument
..
..
bash: echo: write error: Invalid argument
bash: echo: write error: Invalid argument
[console-expect]#echo $?
echo $?
0
[console-expect]#for j in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo ondemand > $j; done
for j in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo ondemand > $j; done
bash: echo: write error: Invalid argument
bash: echo: write error: Invalid argument
..
..
bash: echo: write error: Invalid argument
bash: echo: write error: Invalid argument
kloh@openpowerlcb:~$
kyle-ibm commented 4 years ago

if I Control-C end the op-test script, errors point the OpTestThread.py class OpSSHThreadLinearVar1


^CException in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/kloh/op-test/common/OpTestThread.py", line 66, in run
    self.name, self.cmd_list, self.sleep_time, self.execution_time, self.ignore_fail)
  File "/home/kloh/op-test/common/OpTestThread.py", line 76, in inband_child_thread
    self.c.run_command(cmd)
  File "/home/kloh/op-test/common/OpTestSSH.py", line 225, in run_command
    return self.util.run_command(self, command, timeout, retry)
  File "/home/kloh/op-test/common/OpTestUtil.py", line 1611, in run_command
    output = self.try_command(term_obj, command, timeout)
  File "/home/kloh/op-test/common/OpTestUtil.py", line 1632, in try_command
    pty.sendline(command)
  File "/home/kloh/.local/lib/python3.6/site-packages/pexpect/pty_spawn.py", line 577, in sendline
    return self.send(s + self.linesep)
  File "/home/kloh/.local/lib/python3.6/site-packages/pexpect/pty_spawn.py", line 565, in send
    self._log(s, 'send')
  File "/home/kloh/.local/lib/python3.6/site-packages/pexpect/spawnbase.py", line 127, in _log
    self.logfile.flush()
BrokenPipeError: [Errno 32] Broken pipe
kyle-ibm commented 4 years ago

Also, test can pass if i comment out the thread-1 test from the script.

gautshen commented 4 years ago

On the system, what is the output of

cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors

and

cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
Oracle-Chen commented 4 years ago

Hi, gautshen After run EMStress test and output cmd: [2020-04-07 16:18:24] [console-expect]#cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_goverrnors [2020-04-07 16:18:51] conservative ondemand userspace powersave performance schedutil [2020-04-07 16:18:51] [console-expect]#cat
sys/devices/system/cpu/cpu0/cpufreq/scaling_governor [2020-04-07 16:19:12] userspace

EMStress.RuntimeEMStress_SUT3.log

Peiyu-Jhong commented 4 years ago

We use the latest op-test version and run this test again. The result still failed.

The log message is below: [console-expect]#ERROR (12709.448s) Log file: /home/ooo/Pnor_test/0313_op-test/test-reports/test-run-20200416151712/20200416151745-Thread-1.log logcmd: tee /home/ooo/Pnor_test/0313_op-test/test-reports/test-run-20200416151712/20200416151745-Thread-1.log| sed -u -e 's/\r$//g'|cat -v <subprocess.Popen object at 0x7fafdc6c7a90> Log file: <_io.TextIOWrapper name=10 encoding='utf-8'> Log file: /home/ooo/Pnor_test/0313_op-test/test-reports/test-run-20200416151712/20200416151749-Thread-2.log logcmd: tee /home/ooo/Pnor_test/0313_op-test/test-reports/test-run-20200416151712/20200416151749-Thread-2.log| sed -u -e 's/\r$//g'|cat -v <subprocess.Popen object at 0x7fafd8505f60> Log file: <_io.TextIOWrapper name=16 encoding='utf-8'> Log file: /home/ooo/Pnor_test/0313_op-test/test-reports/test-run-20200416151712/20200416151749-Thread-3.log logcmd: tee /home/ooo/Pnor_test/0313_op-test/test-reports/test-run-20200416151712/20200416151749-Thread-3.log| sed -u -e 's/\r$//g'|cat -v <subprocess.Popen object at 0x7fafd8518588> Log file: <_io.TextIOWrapper name=21 encoding='utf-8'>

====================================================================== ERROR [12709.448s]: runTest (testcases.EMStress.RuntimeEMStress)

Traceback (most recent call last): File "/home/ooo/Pnor_test/0313_op-test/testcases/EMStress.py", line 137, in runTest for core in range(1, num_avail_cores + 1): TypeError: 'float' object cannot be interpreted as an integer


Ran 1 test in 12709.750s

FAILED (errors=1) 20200416_EMStress_fail.zip

Gene-Lo commented 2 years ago

We use run this test in Rhel8.4 again, the result still failed.

《OP-Test Log》 test-run-20211211212625.zip

《SUT's Config》 [Kernel] 4.18.0-305.25.1.el8_4.ppc64le

[FW Config] BMC: op940.22.mih-1-0-g41157d8d2e Pnor: OP9_v2.4.1-4.31-prod

[HW Config] CPU DD2.3 20 core 2 Micron Technology(MTA18ASF2G72PZ-2G9E1)16GiB x32 SAMSUNG PM985 (MZ1LB960HAJQ-00007) 960GB M.2 x1 PSU ACBEL 2000w 2 Slot1: 2-PORT 100Gb ROCE EN CONNECTX-5 GEN4 PCIe x16 LP CAPABLE ADAPTER