pmem / pmdk

Persistent Memory Development Kit
https://pmem.io
Other
1.34k stars 510 forks source link

ppc64el obj_basic_integration/TEST5 crashed on valgrind (debian + ubuntu; ppc64el) #6079

Closed bryceharrington closed 1 month ago

bryceharrington commented 6 months ago

ISSUE: ppc64el obj_basic_integration/TEST5 crashed on valgrind (debian + ubuntu; ppc64el)

Environment Information

Please provide a reproduction of the bug:

Both Debian and Ubuntu are failing to build on the ppc64el architecture, where it used to build successfully at least a few months ago. I am guessing it started appearing after rebuilding against a newer linux-libc-dev?

How often bug is revealed: (always, often, rare): always

Actual Behavior

In Debian: https://buildd.debian.org/status/fetch.php?pkg=pmdk&arch=ppc64el&ver=1.13.1-1.1%2Bb1&stamp=1708597682&raw=0 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1064559

In Ubuntu: https://launchpadlibrarian.net/724116691/buildlog_ubuntu-noble-ppc64el.pmdk_1.13.1-1.1build1_BUILDING.txt.gz https://launchpadlibrarian.net/724821331/buildlog_ubuntu-noble-ppc64el.pmdk_1.13.1-1.1build2_BUILDING.txt.gz https://bugs.launchpad.net/ubuntu/+source/pmdk/+bug/2061913

Details

obj_basic_integration/TEST5 crashed (signal 4). err5.log below. {ut_backtrace.c:175 ut_sighandler} obj_basic_integration/TEST5:

{ut_backtrace.c:176 ut_sighandler} obj_basic_integration/TEST5: Signal 4, backtrace: {ut_backtrace.c:120 ut_dump_backtrace} obj_basic_integration/TEST5: 0: ./obj_basic_integration(+0xc9f8) [0x18c9f8] {ut_backtrace.c:120 ut_dump_backtrace} obj_basic_integration/TEST5: 1: ./obj_basic_integration(+0xcb8c) [0x18cb8c] {ut_backtrace.c:178 ut_sighandler} obj_basic_integration/TEST5:

err5.log below. obj_basic_integration/TEST5 err5.log {ut_backtrace.c:175 ut_sighandler} obj_basic_integration/TEST5: obj_basic_integration/TEST5 err5.log obj_basic_integration/TEST5 err5.log {ut_backtrace.c:176 ut_sighandler} obj_basic_integration/TEST5: Signal 4, backtrace: obj_basic_integration/TEST5 err5.log {ut_backtrace.c:120 ut_dump_backtrace} obj_basic_integration/TEST5: 0: ./obj_basic_integration(+0xc9f8) [0x18c9f8] obj_basic_integration/TEST5 err5.log {ut_backtrace.c:120 ut_dump_backtrace} obj_basic_integration/TEST5: 1: ./obj_basic_integration(+0xcb8c) [0x18cb8c] obj_basic_integration/TEST5 err5.log {ut_backtrace.c:178 ut_sighandler} obj_basic_integration/TEST5: obj_basic_integration/TEST5 err5.log

Last 30 lines of memcheck5.log below (whole file has 48 lines). obj_basic_integration/TEST5 memcheck5.log ==89952== by 0x4915EB7: util_pool_create_uuids (set.c:2521) obj_basic_integration/TEST5 memcheck5.log ==89952== by 0x49160FB: util_pool_create (set.c:2563) obj_basic_integration/TEST5 memcheck5.log ==89952== by 0x4941183: pmemobj_createU (obj.c:1164) obj_basic_integration/TEST5 memcheck5.log ==89952== by 0x4941643: pmemobj_create (obj.c:1244) obj_basic_integration/TEST5 memcheck5.log ==89952== Your program just tried to execute an instruction that Valgrind obj_basic_integration/TEST5 memcheck5.log ==89952== did not recognise. There are two possible reasons for this. obj_basic_integration/TEST5 memcheck5.log ==89952== 1. Your program has a bug and erroneously jumped to a non-code obj_basic_integration/TEST5 memcheck5.log ==89952== location. If you are running Memcheck and you just saw a obj_basic_integration/TEST5 memcheck5.log ==89952== warning about a bad jump, it's probably your program's fault. obj_basic_integration/TEST5 memcheck5.log ==89952== 2. The instruction is legitimate but Valgrind doesn't handle it, obj_basic_integration/TEST5 memcheck5.log ==89952== i.e. it's Valgrind's fault. If you think this is the case or obj_basic_integration/TEST5 memcheck5.log ==89952== you are not sure, please let us know and we'll try to fix it. obj_basic_integration/TEST5 memcheck5.log ==89952== Either way, Valgrind will now raise a SIGILL signal which will obj_basic_integration/TEST5 memcheck5.log ==89952== probably kill your program. obj_basic_integration/TEST5 memcheck5.log ==89952== obj_basic_integration/TEST5 memcheck5.log ==89952== HEAP SUMMARY: obj_basic_integration/TEST5 memcheck5.log ==89952== in use at exit: 3,172 bytes in 39 blocks obj_basic_integration/TEST5 memcheck5.log ==89952== total heap usage: 193 allocs, 154 frees, 433,659 bytes allocated obj_basic_integration/TEST5 memcheck5.log ==89952== obj_basic_integration/TEST5 memcheck5.log ==89952== LEAK SUMMARY: obj_basic_integration/TEST5 memcheck5.log ==89952== definitely lost: 0 bytes in 0 blocks obj_basic_integration/TEST5 memcheck5.log ==89952== indirectly lost: 0 bytes in 0 blocks obj_basic_integration/TEST5 memcheck5.log ==89952== possibly lost: 0 bytes in 0 blocks obj_basic_integration/TEST5 memcheck5.log ==89952== still reachable: 3,172 bytes in 39 blocks obj_basic_integration/TEST5 memcheck5.log ==89952== suppressed: 0 bytes in 0 blocks obj_basic_integration/TEST5 memcheck5.log ==89952== Reachable blocks (those to which a pointer was found) are not shown. obj_basic_integration/TEST5 memcheck5.log ==89952== To see them, rerun with: --leak-check=full --show-leak-kinds=all obj_basic_integration/TEST5 memcheck5.log ==89952== obj_basic_integration/TEST5 memcheck5.log ==89952== For lists of detected and suppressed errors, rerun with: -s obj_basic_integration/TEST5 memcheck5.log ==89952== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

There are also some instances of valgrind crashes:

pmempool_feature/TEST4: SETUP (check/pmem/debug/memcheck) ../unittest/unittest.sh: line 747: 1396902 Illegal instruction /usr/bin/valgrind --tool=memcheck --log-file=memcheck4.log --suppressions=../memcheck-dlopen.supp --suppressions=../memcheck-dlopen.supp --leak-check=full --suppressions=../ld.supp --suppressions=../memcheck-libunwind.supp --suppressions=../memcheck-ndctl.supp ../../tools/pmempool/pmempool feature -d SHUTDOWN_STATE /tmp//test_pmempool_feature4😘⠏⠍⠙⠅ɗPMDKӜ⥺🙋/testset &>> grep4.log pmempool_feature/TEST4 crashed (signal 4). grep4.log below.

RUNTESTS: stopping: pmempool_feature/TEST4 failed, TEST=check FS=any BUILD=debug pmempool_feature/TEST5: SETUP (check/pmem/debug/memcheck) ../unittest/unittest.sh: line 747: 1397154 Illegal instruction /usr/bin/valgrind --tool=memcheck --log-file=memcheck5.log --suppressions=../memcheck-dlopen.supp --suppressions=../memcheck-dlopen.supp --leak-check=full --suppressions=../ld.supp --suppressions=../memcheck-libunwind.supp --suppressions=../memcheck-ndctl.supp ../../tools/pmempool/pmempool feature -d SHUTDOWN_STATE /tmp//test_pmempool_feature5😘⠏⠍⠙⠅ɗPMDKӜ⥺🙋/testset &>> grep5.log pmempool_feature/TEST5 crashed (signal 4). grep5.log below. pmempool_feature/TEST5 grep5.log query SHUTDOWN_STATE result is 1

1

Last 30 lines of memcheck5.log below (whole file has 65 lines). pmempool_feature/TEST5 memcheck5.log ==1397154== Illegal opcode at address 0x4B59240 pmempool_feature/TEST5 memcheck5.log ==1397154== at 0x4B59240: ppc_flush (init.c:53) pmempool_feature/TEST5 memcheck5.log ==1397154== by 0x4B519C7: pmem_flush (pmem.c:229) pmempool_feature/TEST5 memcheck5.log ==1397154== by 0x4B51A6B: pmem_persist (pmem.c:240) pmempool_feature/TEST5 memcheck5.log ==1397154== by 0x492CA93: util_persist (util_pmem.h:27) pmempool_feature/TEST5 memcheck5.log ==1397154== by 0x492CBA7: util_persist_auto (util_pmem.h:40) pmempool_feature/TEST5 memcheck5.log ==1397154== by 0x492DDC3: set_hdr (feature.c:256) pmempool_feature/TEST5 memcheck5.log ==1397154== by 0x492E143: feature_set (feature.c:325) pmempool_feature/TEST5 memcheck5.log ==1397154== by 0x492E967: disable_shutdown_state (feature.c:500) pmempool_feature/TEST5 memcheck5.log ==1397154== by 0x492EF2F: pmempool_feature_disableU (feature.c:662) pmempool_feature/TEST5 memcheck5.log ==1397154== by 0x492F1AB: pmempool_feature_disable (feature.c:738) pmempool_feature/TEST5 memcheck5.log ==1397154== by 0x196897: feature_perform (feature.c:110) pmempool_feature/TEST5 memcheck5.log ==1397154== by 0x196897: pmempool_feature_func (feature.c:206) pmempool_feature/TEST5 memcheck5.log ==1397154== by 0x18A45B: main (pmempool.c:271) pmempool_feature/TEST5 memcheck5.log ==1397154== pmempool_feature/TEST5 memcheck5.log ==1397154== HEAP SUMMARY: pmempool_feature/TEST5 memcheck5.log ==1397154== in use at exit: 52,839 bytes in 21 blocks pmempool_feature/TEST5 memcheck5.log ==1397154== total heap usage: 64 allocs, 43 frees, 108,953 bytes allocated pmempool_feature/TEST5 memcheck5.log ==1397154== pmempool_feature/TEST5 memcheck5.log ==1397154== LEAK SUMMARY: pmempool_feature/TEST5 memcheck5.log ==1397154== definitely lost: 0 bytes in 0 blocks pmempool_feature/TEST5 memcheck5.log ==1397154== indirectly lost: 0 bytes in 0 blocks pmempool_feature/TEST5 memcheck5.log ==1397154== possibly lost: 0 bytes in 0 blocks pmempool_feature/TEST5 memcheck5.log ==1397154== still reachable: 50,479 bytes in 16 blocks pmempool_feature/TEST5 memcheck5.log ==1397154== suppressed: 2,360 bytes in 5 blocks pmempool_feature/TEST5 memcheck5.log ==1397154== Reachable blocks (those to which a pointer was found) are not shown. pmempool_feature/TEST5 memcheck5.log ==1397154== To see them, rerun with: --leak-check=full --show-leak-kinds=all pmempool_feature/TEST5 memcheck5.log ==1397154== pmempool_feature/TEST5 memcheck5.log ==1397154== For lists of detected and suppressed errors, rerun with: -s pmempool_feature/TEST5 memcheck5.log ==1397154== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

janekmi commented 6 months ago

Hi. Thanks for the report. Sadly, we do not support ppc64. But since you are suggesting the issue might be related to the latest linux-libc-dev the question is: have you tried to build it on amd64 and using the same software components?

pbalcer commented 6 months ago

This is either a glibc or valgrind issue. See this message from the attached log:

Your program just tried to execute an instruction that Valgrind did not recognise. There are two possible reasons for this.

  1. Your program has a bug and erroneously jumped to a non-code location. If you are running Memcheck and you just saw a warning about a bad jump, it's probably your program's fault.
  2. The instruction is legitimate but Valgrind doesn't handle it, i.e. it's Valgrind's fault. If you think this is the case or you are not sure, please let us know and we'll try to fix it. Either way, Valgrind will now raise a SIGILL signal which will probably kill your program.

Illegal opcode at address 0x4B59240

The instruction in question is this one: https://github.com/pmem/pmdk/blob/master/src/libpmem2/ppc64/init.c#L53 So it's most likely the latter. It's odd that it showed up as after updating libc though. Maybe there's now some other instruction there?

As @janekmi mentioned, Intel does not provide support for the PPC backend of PMDK. See this README section for details.

bryceharrington commented 5 months ago

Thanks for pointing to __DCBF as the likely instruction causing the issue.

have you tried to build it on amd64 and using the same software components?

Indeed; we build for a number of architectures for Ubuntu, and the failure on ppc64el was holding back those updated builds of pmdk in the Ubuntu 24.04 LTS. Only ppc64el hit this particular issue. We were undertaking a mass re-build of the entire archive for some distro-wide security fixes and performance improvements, and I think might have been the first time pmdk got a rebuild since a new libc introduction, which is why I suspected that. Debian hit the issue earlier than us, but also updated libc before us.

And as I mentioned above, we could not ascertain if it is down to one cause or several. My gut says there may be additional missing instructions, but I did not acquire proof one way or the other. We also did not determine whether the tests were identifying "real" problems that users would run into, or were simply test suite strictness (your advice/opinion on this point would be valued).

We also noted your documented limitations on support for this platform in the README (thanks for having that officially in writing), and took that into account as well in determining what to do on our end. We also are constrained in hardware access for this architecture for debugging purposes, as well as time and know-how limitations. We considered dropping support for the architecture ourselves for pmdk, but worried that would simply move the problem to dependencies, and instead have disabled the testsuite in our CI for ppc64el and listed it as a Known Issue in the 24.04 release notes.

Ideally, we'd like to supply a stronger resolution to this going forward (especially if this will regress pmdk ppc64el users), even if it means dropping the architecture as supported in Ubuntu. If you don't have inclination to investigate, that is probably the right long term solution here. However if it is something you do want to investigate further, we would be happy to collaborate, just with the caveat that our ability to test/debug/develop on this arch is constrained.

pbalcer commented 5 months ago

Is your build system using PMDK's fork of valgrind (https://github.com/pmem/valgrind)? If it does, then it's possible that a new libc version is issuing instructions that the forked valgrind does not support. So that'd be an issue on PMDK's side. The fix is simply to rebase the valgrind fork to the latest upstream version. This is a problem we've encountered a few times in the past.

If you don't, then the bug is in valgrind and it should add support for the necessary instructions (so upstreaming parts of this patch). But if that's the case, how did this work before?

We also did not determine whether the tests were identifying "real" problems that users would run into, or were simply test suite strictness

The primary use of valgrind in PMDK's test suite is to verify the correctness of its algorithms. However, end users may still encounter issues such as the one you've reported if they themselves run applications linked with libpmem under valgrind, if the valgrind version they are using does not support all the necessary instructions.

We also are constrained in hardware access for this architecture for debugging purposes, as well as time and know-how limitations.

PMDK's CI environment does not include any PPC system at this point in time. Given the state of the project, we are unlikely to invest to acquire one.

even if it means dropping the architecture as supported in Ubuntu

That would be my recommendation. For all intents and purposes, upstream PMDK does not offer non-experimental support for platforms other than x86-64. However, simply disabling valgrind checks in the CI is also a reasonable option. 99.9% of the code is shared between all platforms (https://github.com/pmem/pmdk/tree/master/src/libpmem2/ppc64 this directory contains most of what differs, it's all fairly simple). So all the core algorithms are tested regardless on x86 builds.

if it is something you do want to investigate further, we would be happy to collaborate, just with the caveat that our ability to test/debug/develop on this arch is constrained

PMDK maintenance is currently done almost exclusively by Intel, and with very limited resources. We can help to some small extent, but ultimately you might want to reach out to IBM whether having official PMDK packages in ubuntu for their platforms is something they still care about.

bryceharrington commented 5 months ago

Is your build system using PMDK's fork of valgrind (https://github.com/pmem/valgrind)? If it does, then it's possible that a new libc version is issuing instructions that the forked valgrind does not support. So that'd be an issue on PMDK's side. The fix is simply to rebase the valgrind fork to the latest upstream version. This is a problem we've encountered a few times in the past.

It doesn't look like it to me, although I do note the presence of the valgrind headers in src/core. But as a build-dependency, valgrind 3.15 or newer is being required. Both pmdk and valgrind got rebuilt in the archives within the last month, so I'm doubtful this is simply ABI compatibility, particularly given it occurring only on the one architecture.

We also did not determine whether the tests were identifying "real" problems that users would run into, or were simply test suite strictness

The primary use of valgrind in PMDK's test suite is to verify the correctness of its algorithms. However, end users may still encounter issues such as the one you've reported if they themselves run applications linked with libpmem under valgrind, if the valgrind version they are using does not support all the necessary instructions.

That's good to note, thanks. If potential issues would be limited to use of valgrind, then anyone developing on ppc64el in Ubuntu 24.04 would presumably have some options to work around the issue; hopefully for those cases the 24.04 release notes will be enough clue. If reports of tangible problems affecting users crop up we can re-evaluate but for now it sounds like we should continue to provide the package on this architecture in hopes that it helps more than harms.

even if it means dropping the architecture as supported in Ubuntu

That would be my recommendation. For all intents and purposes, upstream PMDK does not offer non-experimental support for platforms other than x86-64. However, simply disabling valgrind checks in the CI is also a reasonable option. 99.9% of the code is shared between all platforms (https://github.com/pmem/pmdk/tree/master/src/libpmem2/ppc64 this directory contains most of what differs, it's all fairly simple). So all the core algorithms are tested regardless on x86 builds.

if it is something you do want to investigate further, we would be happy to collaborate, just with the caveat that our ability to test/debug/develop on this arch is constrained

PMDK maintenance is currently done almost exclusively by Intel, and with very limited resources. We can help to some small extent, but ultimately you might want to reach out to IBM whether having official PMDK packages in ubuntu for their platforms is something they still care about.

That's a good suggestion, we'll reach out to our contacts before deciding what to do on 24.10 and going forward. Thanks.

janekmi commented 1 month ago

I am closing for now. @bryceharrington please re-open if necessary. Thanks!

peter-bergner commented 1 week ago

Last 30 lines of memcheck5.log below (whole file has 65 lines). pmempool_feature/TEST5 memcheck5.log ==1397154== Illegal opcode at address 0x4B59240 pmempool_feature/TEST5 memcheck5.log ==1397154== at 0x4B59240: ppc_flush (init.c:53)

I think the problem here is that valgrind is being too smart and recognizing that the "dcbf r0,RB,6" version of the instruction is a Power10 version of dcbf where the extra L operand was added. When I execute a simple binary with "dcbf r0,RB,6" on a Power10 system (assuming RB points to some real memory), it executes fine and valgrind has no problem with it. If I take the same binary and execute it on a Power9 system, then it again executes fine, but valgrind flags the dcbf instruction as illegal. The commit that added the _DCBF usage mentions that ISA 3.1 says the L=6 version of the instruction acts like L=0 version of the instruction on older processors. I couldn't find where ISA 3.1 said that, so could the author who wrote that point where in ISA 3.1 that comment exists?

If it is true that L=6 should act like L=0 on older than Power10 cpus, then valgrind shouldn't flag the instruction as illegal when run on those older cpus. I'll verify L=6 is ok on Power9 and earlier with our hardware team and will talk with my valgrind developer about a fix if that is the case.