trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.21k stars 567 forks source link

Test STKDoc_tests_stk_mesh_doc_tests_MPI_4 unit test StkMeshHowTo.useAutomaticGeneratedAura randomly failing/segfaulting in PR build gnu-8.5.0-openmpi-4.1.6-openmp since 2024-06-26 #13244

Open bartlettroscoe opened 4 months ago

bartlettroscoe commented 4 months ago

CC: @alanw0, @sebrowne, @achauphan

Next Action Status

## Description As shown in [this query](https://trilinos-cdash.sandia.gov/queryTests.php?project=Trilinos&begin=2024-01-01&end=2024-07-16&filtercount=3&showfilters=1&filtercombine=and&field1=testname&compare1=61&value1=STKDoc_tests_stk_mesh_doc_tests_MPI_4&field2=status&compare2=62&value2=Passed&field3=testoutput&compare3=95&value3=Segmentation%20fault) (click "Shown Matching Output" in upper right) the test: * `STKDoc_tests_stk_mesh_doc_tests_MPI_4` in the unique GenConfig build: * `rhel8_sems-gnu-8.5.0-openmpi-4.1.6-openmp_release-debug_static_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables` started randomly failing/segfaulting on testing day 2024-06-26. The specific set of CDash builds impacted where: * `PR-13164-test-rhel8_sems-gnu-8.5.0-openmpi-4.1.6-openmp_release-debug_static_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-20` * `PR-13165-test-rhel8_sems-gnu-8.5.0-openmpi-4.1.6-openmp_release-debug_static_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-17` * `PR-13191-test-rhel8_sems-gnu-8.5.0-openmpi-4.1.6-openmp_release-debug_static_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-51` * `PR-13197-test-rhel8_sems-gnu-8.5.0-openmpi-4.1.6-openmp_release-debug_static_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-73` * `PR-13206-test-rhel8_sems-gnu-8.5.0-openmpi-4.1.6-openmp_release-debug_static_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-119` * `PR-13212-test-rhel8_sems-gnu-8.5.0-openmpi-4.1.6-openmp_release-debug_static_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-98` When the test segfault, it looks like: ``` *** Starting test StkMeshHowTo.useNoAura [ OK ] StkMeshHowTo.useNoAura (0 ms) *** Starting test StkMeshHowTo.useAutomaticGeneratedAura [ OK ] StkMeshHowTo.useAutomaticGeneratedAura (0 ms) *** Starting test StkMeshHowTo.use_generate_new_ids [ascic0194:3690832] *** Process received signal *** [ascic0194:3690832] Signal: Segmentation fault (11) [ascic0194:3690832] Signal code: Address not mapped (1) [ascic0194:3690832] Failing at address: (nil) [ascic0194:3690832] [ 0] /lib64/libpthread.so.0(+0x12cf0)[0x7efe1cf64cf0] [ascic0194:3690832] [ 1] /lib64/libc.so.6(__libc_malloc+0x146)[0x7efe1cc288c6] ``` ## Current Status on CDash Run the [above query](https://trilinos-cdash.sandia.gov/queryTests.php?project=Trilinos&begin=2024-01-01&end=2024-07-16&filtercount=3&showfilters=1&filtercombine=and&field1=testname&compare1=61&value1=STKDoc_tests_stk_mesh_doc_tests_MPI_4&field2=status&compare2=62&value2=Passed&field3=testoutput&compare3=95&value3=Segmentation%20fault) adjusting the "Begin" and "End" dates to match today any other date range or just click "CURRENT" in the top bar to see results for the current testing day. ## Steps to Reproduce See: * https://github.com/trilinos/Trilinos/wiki/Reproducing-PR-Testing-Errors If you can't figure out what commands to run to reproduce the problem given this documentation, then please post a comment here and we will give you the exact minimal commands.
bartlettroscoe commented 4 months ago

@achauphan and @sebrowne, did the frameworks monitoring of the randomly failing tests not pick up this random test failure?

I only decided to run this query after one of the PRs I was reviewing showed this failure. But this test had failed/segfaulted randomly five other times before since the end of last month (and no one bothered to post an issue for this?).

alanw0 commented 4 months ago

Thanks for the notification. That's troubling... I haven't seen that test fail in recent memory, and haven't known it to exhibit non-deterministic or random behavior. I'll look into it and try to resolve what's happening.

sebrowne commented 3 months ago

@bartlettroscoe I looked back through the history of the tool’s messages and it has not flagged that test at all. Remember, all we’re currently flagging are tests that failed, then passed on the same SHA1.

EDIT: It flagged it from last week, but not prior to that.

bartlettroscoe commented 3 months ago

@bartlettroscoe I looked back through the history of the tool’s messages and it has not flagged that test at all. Remember, all we’re currently flagging are tests that failed, then passed on the same SHA1.

Not surprising. The current screening approach will miss a lot of actual random failures.

EDIT: It flagged it from last week, but not prior to that.

The next step is to run a query looking for that same test failure with similar output where that test is the only test failing in that build. That was the case with these particular test failure. You could write an automated tool to do this.

alanw0 commented 3 months ago

I've identified some undefined behavior associated with using something similar to &vec[0] on an empty vector, which can dereference a null pointer. Disappointingly, that often doesn't cause a seg-fault, but it can. In any case I will try to get a stk update into trilinos as soon as I can.

alanw0 commented 3 months ago

This should be addressed by #13288. That pull-request turned this test off. A coming-soon stk update will fix the actual undefined-behavior which is causing that test to be flaky.