Open nmhamster opened 5 years ago
I should have added that all the communications here are on-node and host-only, this specific code is not using CUDA or any kernel/code offloads.
@nmhamster the segfault is originating from MueLu_Maxwell3D.exe (frame 6) calling 'atoi', probably a corrupted string pointer (maybe as a result of using UCX). Is it possible to track in the application why the pointer is corrupted?
4 /lib64/libc.so.6(strtoll+0x2c) [0x3fff7ad5675c]
5 /lib64/libc.so.6(atoi+0x20) [0x3fff7ad51c30]
6 /ascldap/users/sdhammo/git/trilinos-github-repo/build-muelu-gcc-720-cuda-101105/packages/muelu/test/maxwell/MueLu_Maxwell3D.exe() [0x156bc38c]
Hi UCX Developers,
We are building our Trilinos tests against UCX 1.5.1 with OpenMPI 4.0.1, also configured for CUDA10.1 on our POWER8 test systems. We have a new bug in one our tests that looks like it is UCX related. The code in this sequence, opens a file, reads small chunks of it in, broadcasts it to all MPI ranks who then process it locally to develop local partitions of a global (parallel) matrix. This code has been in our test suite for upwards of 11 years and has run well across all sorts of machines.
(See: https://github.com/trilinos/Trilinos/issues/5033)
When run with UCX enabled for OpenMPI the job crashes with the following:
If we completely disable UCX:
We get correct behavior.
For reference our environment is configured to disable UCX memory caching (
UCX_MEMTYPE_CACHE=n
) because of the responses to UCX Issue #3550.Checking the debug reference to
libucs.so.0
I get the following:Which is the segmentation-fault handler firing. I don't seem to get other references to UCX libraries in the stack trace for this particular crash. Are there other memory operations in UCX we should try to disable to see if we can track this down at all?
Thanks for your help.