libnxz / power-gzip

POWER NX zlib compliant library
23 stars 18 forks source link

Segmentation fault observed while running selftests [ test_multithread_stress ] #152

Closed Pavithra1602 closed 2 years ago

Pavithra1602 commented 2 years ago

--- Steps to recreate ----

# cd power-gzip/
# make check

--- Logs ----

make[3]: Entering directory '/root/power-gzip/test'
PASS: test_adler32
PASS: test_buf_error
PASS: test_crc32
PASS: test_inflatesyncpoint
[252369.993669] show_signal_msg: 2 callbacks suppressed
[252369.993684] lt-test_multith[7489]: segfault (11) at 7fff89240400 nip 7fff8ad35e94 lr 7fff8ad35ec0 code 1 in libnxz.so.0.0.63[7fff8ad30000+40000]
[252369.993733] lt-test_multith[7487]: segfault (11) at 7fff89240400 nip 7fff8ad3604c lr 7fff8ad36030 code 1
[252369.993737] lt-test_multith[7489]: code: fa2100d8 3b2003e8 3b010090 3ab536d8 3ac0015d fa4100e0 3af73560 3a733628 
[252369.993738]  in libnxz.so.0.0.63[7fff8ad30000+40000]
[252369.993742] lt-test_multith[7489]: code: 3a80013d 7c0004ac 7c3dfe0c e93c0008 <7c3d4f0d> 7d380026 7c0004ac 5529273e 
[252369.993747] 
[252369.993755] lt-test_multith[7487]: code: e8690072 4bffcf01 e8410018 7dc37378 4bffc7d5 e8410018 e9c100c0 e9e100c8 
[252369.993762] lt-test_multith[7487]: code: ea0100d0 7c0004ac 7c3dfe0c e93c0008 <7c3d4f0d> 7d380026 7c0004ac 5529273e 
../test-driver: line 107:  7109 Segmentation fault      (core dumped) "$@" > $log_file 2>&1
FAIL: test_multithread_stress
PASS: test_pid_reuse
SKIP: test_abi
PASS: test_stress.auto
PASS: test_deflate.auto
PASS: test_inflate.auto
PASS: test_dict.auto
PASS: test_stress.sw
PASS: test_deflate.sw
PASS: test_inflate.sw
PASS: test_dict.sw
PASS: test_stress.nx
PASS: test_deflate.nx
PASS: test_inflate.nx
PASS: test_dict.nx
PASS: test_stress.mix
PASS: test_deflate.mix
PASS: test_inflate.mix
PASS: test_dict.mix
PASS: test_stress.mix2
PASS: test_deflate.mix2
PASS: test_inflate.mix2
PASS: test_dict.mix2
PASS: test_zeroinput.nx
============================================================================
Testsuite summary for libnxz 0.63
============================================================================
# TOTAL: 28
tuliom commented 2 years ago

@Pavithra1602 Could you give us more details, please?

cd power-gzip/ make check

I tried to execute these commands and I was not able to reproduce it. How are you running configure?

Which Linux distribution are you using? Which commit ID did you use? What happens when you run make valgrind-check?

Pavithra1602 commented 2 years ago

@Pavithra1602 Could you give us more details, please?

cd power-gzip/ make check

I tried to execute these commands and I was not able to reproduce it. How are you running configure? Issue is not consistent, i observed this once in 3 trials. Running it using "make check" command

Which Linux distribution are you using? I have seen this issue on rel8.6 and rhel9

Which commit ID did you use? I am using master branch

What happens when you run make valgrind-check? [root@ltcden13-lp3 power-gzip]# make valgrind-check make -C test check \ LOG_COMPILER="valgrind" LOG_FLAGS="--leak-check=full --suppressions=/root/power-gzip/test/valgrind.supp --error-exitcode=1" make[1]: Entering directory '/root/power-gzip/test' make test_adler32 test_buf_error test_crc32 test_inflatesyncpoint test_multithread_stress test_pid_reuse test_stress test_deflate test_inflate test_dict test_zeroinput test_abi make[2]: Entering directory '/root/power-gzip/test' make[2]: 'test_adler32' is up to date. make[2]: 'test_buf_error' is up to date. make[2]: 'test_crc32' is up to date. make[2]: 'test_inflatesyncpoint' is up to date. make[2]: 'test_multithread_stress' is up to date. make[2]: 'test_pid_reuse' is up to date. make[2]: 'test_stress' is up to date. make[2]: 'test_deflate' is up to date. make[2]: 'test_inflate' is up to date. make[2]: 'test_dict' is up to date. make[2]: 'test_zeroinput' is up to date. make[2]: Nothing to be done for 'test_abi'. make[2]: Leaving directory '/root/power-gzip/test' make check-TESTS make[2]: Entering directory '/root/power-gzip/test' make[3]: Entering directory '/root/power-gzip/test' FAIL: test_adler32 FAIL: test_buf_error FAIL: test_crc32 FAIL: test_inflatesyncpoint FAIL: test_multithread_stress FAIL: test_pid_reuse FAIL: test_abi FAIL: test_stress.auto FAIL: test_deflate.auto FAIL: test_inflate.auto FAIL: test_dict.auto FAIL: test_stress.sw FAIL: test_deflate.sw FAIL: test_inflate.sw FAIL: test_dict.sw FAIL: test_stress.nx FAIL: test_deflate.nx FAIL: test_inflate.nx FAIL: test_dict.nx FAIL: test_stress.mix FAIL: test_deflate.mix FAIL: test_inflate.mix FAIL: test_dict.mix FAIL: test_stress.mix2 FAIL: test_deflate.mix2 FAIL: test_inflate.mix2 FAIL: test_dict.mix2 FAIL: test_zeroinput.nx

Testsuite summary for libnxz 0.63

TOTAL: 28

PASS: 0

SKIP: 0

XFAIL: 0

FAIL: 28

XPASS: 0

ERROR: 0

============================================================================ See test/test-suite.log Please report to https://github.com/libnxz/power-gzip

make[3]: [Makefile:910: test-suite.log] Error 1 make[3]: Leaving directory '/root/power-gzip/test' make[2]: [Makefile:1018: check-TESTS] Error 2 make[2]: Leaving directory '/root/power-gzip/test' make[1]: [Makefile:1281: check-am] Error 2 make[1]: Leaving directory '/root/power-gzip/test' make: [Makefile:1007: valgrind-check] Error 2 [root@ltcden13-lp3 power-gzip]#

Pavithra1602 commented 2 years ago

Issue is observed only on Denali

tuliom commented 2 years ago

Issue is not consistent, i observed this once in 3 trials. Running it using "make check" command

@Pavithra1602 , make check won't execute without running configure first. Unless you have some leftovers from a previous build.

Could you test if the following fails too, please?

mkdir test-libnxz && cd test-libnxz
git clone https://github.com/libnxz/power-gzip.git 
mkdir build && cd build
../power-gzip/configure
make -j$(nproc)
make -j$(nproc) check
Pavithra1602 commented 2 years ago

make -j$(nproc) check

Issue is observed even with above steps

[root@ltcden13-lp3 build]# make -j$(nproc) check Making check in lib make[1]: Entering directory '/root/test-libnxz/build/lib' make[1]: Nothing to be done for 'check'. make[1]: Leaving directory '/root/test-libnxz/build/lib' Making check in test make[1]: Entering directory '/root/test-libnxz/build/test' make test_adler32 test_buf_error test_crc32 test_inflatesyncpoint test_multithread_stress test_pid_reuse test_stress test_deflate test_inflate test_dict test_zeroinput test_abi make[2]: Entering directory '/root/test-libnxz/build/test' make[2]: 'test_adler32' is up to date. make[2]: 'test_buf_error' is up to date. make[2]: 'test_crc32' is up to date. make[2]: 'test_inflatesyncpoint' is up to date. make[2]: 'test_multithread_stress' is up to date. make[2]: 'test_pid_reuse' is up to date. make[2]: 'test_stress' is up to date. make[2]: 'test_deflate' is up to date. make[2]: 'test_inflate' is up to date. make[2]: 'test_dict' is up to date. make[2]: 'test_zeroinput' is up to date. make[2]: Nothing to be done for '../../power-gzip/test/test_abi'. make[2]: Leaving directory '/root/test-libnxz/build/test' make check-TESTS make[2]: Entering directory '/root/test-libnxz/build/test' make[3]: Entering directory '/root/test-libnxz/build/test' PASS: test_crc32 SKIP: test_abi PASS: test_adler32 PASS: test_buf_error PASS: test_inflatesyncpoint PASS: test_dict.auto PASS: test_dict.nx PASS: test_stress.sw PASS: test_stress.nx PASS: test_dict.mix PASS: test_stress.mix PASS: test_stress.auto PASS: test_dict.sw PASS: test_dict.mix2 PASS: test_zeroinput.nx PASS: test_stress.mix2 PASS: test_deflate.mix PASS: test_deflate.auto PASS: test_deflate.nx PASS: test_deflate.sw PASS: test_deflate.mix2 PASS: test_inflate.mix2 PASS: test_inflate.sw PASS: test_inflate.nx PASS: test_inflate.auto PASS: test_inflate.mix PASS: test_pid_reuse [ 3545.210153] lt-test_multith[14752]: segfault (11) at 7fff8aa70400 nip 7fff90915e94 lr 7fff90915ec0 code 1 [ 3545.210154] lt-test_multith[14766]: segfault (11) at 7fff8aa70400 nip 7fff90915e94 lr 7fff90915ec0 code 1 [ 3545.210155] lt-test_multith[14765]: segfault (11) at 7fff8aa70400 nip 7fff9091604c lr 7fff90916030 code 1 [ 3545.210153] lt-test_multith[14762]: segfault (11) at 7fff8aa70400 nip 7fff90915e94 lr 7fff90915ec0 code 1 [ 3545.210181] in libnxz.so.0.0.63[7fff90910000+40000] [ 3545.210181] in libnxz.so.0.0.63[7fff90910000+40000] [ 3545.210181] in libnxz.so.0.0.63[7fff90910000+40000] [ 3545.210189] [ 3545.210195] lt-test_multith[14765]: code: e8690072 4bffcf01 e8410018 7dc37378 4bffc7d5 e8410018 e9c100c0 e9e100c8 [ 3545.210195] lt-test_multith[14762]: code: fa2100d8 3b2003e8 3b010090 3ab536e8 3ac0015d fa4100e0 3af73560 3a733638 [ 3545.210200] lt-test_multith[14762]: code: 3a80013d 7c0004ac 7c3dfe0c e93c0008 <7c3d4f0d> 7d380026 7c0004ac 5529273e [ 3545.210200] lt-test_multith[14765]: code: ea0100d0 7c0004ac 7c3dfe0c e93c0008 <7c3d4f0d> 7d380026 7c0004ac 5529273e [ 3545.210202] in libnxz.so.0.0.63[7fff90910000+40000] [ 3545.210209] [ 3545.210219] lt-test_multith[14766]: code: fa2100d8 3b2003e8 3b010090 3ab536e8 3ac0015d fa4100e0 3af73560 3a733638 [ 3545.210228] lt-test_multith[14766]: code: 3a80013d 7c0004ac 7c3dfe0c e93c0008 <7c3d4f0d> 7d380026 7c0004ac 5529273e [ 3545.210265] lt-test_multith[14752]: code: fa2100d8 3b2003e8 3b010090 3ab536e8 3ac0015d fa4100e0 3af73560 3a733638 [ 3545.210274] lt-test_multith[14752]: code: 3a80013d 7c0004ac 7c3dfe0c e93c0008 <7c3d4f0d> 7d380026 7c0004ac 5529273e ../../power-gzip/test-driver: line 107: 11198 Segmentation fault (core dumped) "$@" > $log_file 2>&1 FAIL: test_multithread_stress

Testsuite summary for libnxz 0.63

TOTAL: 28

PASS: 26

SKIP: 1

XFAIL: 0

FAIL: 1

XPASS: 0

ERROR: 0

============================================================================ See test/test-suite.log Please report to https://github.com/libnxz/power-gzip

tuliom commented 2 years ago

With @Pavithra1602 's help, I was able to reproduce this issue. AFAICS, one of the threads from test_multithread_stress fails in nx_wait_for_csb() with:

CSB still not valid after 60 seconds, giving up

Then, it's unclear if the same thread or another one segfaults while running the paste instruction in nxu_run_job():

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fffb7865e94 in vas_paste (offset=0, paste_address=0x7fffb6260400)
    at ../../power-gzip/inc_nx/copy-paste.h:81
81              asm volatile(PPC_PASTE(%1, %2)";"

After the first error, nx_wait_for_csb() returns -ETIMEDOUT. It's necessary to investigate what happens after this value is returned.

rzinsly commented 2 years ago

This was fixed by #159.