ovis-hpc / ldms

OVIS/LDMS High Performance Computing monitoring, analysis, and visualization project.
https://github.com/ovis-hpc/ovis-wiki/wiki
Other
98 stars 50 forks source link

ldms_ls crash on wrong hostname instead of message. #897

Closed baallan closed 3 days ago

baallan commented 2 years ago

ldms_ls seg faults with the gdb trace below for the case of the arguments given to the gdb run command. It behaves similarly without gdb present. The default is to use localhost as the unspecified -h argument. The 'proper' usage is to for the user to specify the host option as the local IB interface name. The desired behavior is that ldms_ls not crash and instead issue a helpful message when rdma connections are attempted on a non-rdma target.

gdb ldms_ls (gdb) run -x rdma -a ovis -p 411 -v -A conf=/path/ldmsauth.conf Starting program: /opt/ovis.aarch64/current/sbin/ldms_ls -x rdma -a ovis -p 411 -v -A conf=/path/ldmsauth.conf [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". [New Thread 0x4000208ef200 (LWP 139016)] [New Thread 0x40002148f200 (LWP 139020)] Connection failed/rejected.

Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x40002148f200 (LWP 139020)] 0x0000400020b9821c in mlx5_free_db () from /lib64/libmlx5-rdmav2.so

(gdb) bt

0 0x0000400020b9821c in mlx5_free_db () from /lib64/libmlx5-rdmav2.so

1 0x0000400020bad00c in mlx5_destroy_cq () from /lib64/libmlx5-rdmav2.so

2 0x000040002093e6d4 in ibv_destroy_cq () from /lib64/libibverbs.so.1

3 0x00004000000d47a0 in __rdma_teardown_conn (ep=0x44b3e0) at zap_rdma.c:462

4 z_rdma_destroy (zep=0x44b3e0) at zap_rdma.c:516

5 0x00004000005125bc in __destroy_ep (zep=) at zap.c:426

6 0x00004000000d7294 in _ref_put (name=, func=, line=2581, r=0x44b3e8) at ./../../ovis_ref/ref.h:70

7 z_rdma_handle_cm_event (thr=0x452a30, ctxt=0x452aa0) at zap_rdma.c:2581

8 z_rdma_io_thread_proc (arg=0x452a30) at zap_rdma.c:2693

9 0x0000400000637c48 in start_thread () from /lib64/libpthread.so.0

10 0x000040000042f600 in thread_start () from /lib64/libc.so.6

narategithub commented 2 years ago

@baallan Do I have an access to this system? I can't reproduce this on OGC systems (we have mlx4, mtcha, cxgb4). The commit ID I tested with is 78649a535d5d9b69ce7b0891440c80248de6071d. ldms_ls correctly reported

/opt/ovis/sbin/ldms_ls: Unknown host

with a long help string.

baallan commented 2 years ago

@tom95858 I don't know that you have root access on stria. I've set up test daemon on stria-login1 that demonstrates the problem. When you have time to look, let me know.

test case reminders to myself: module purge module use /projects/ovis/modules/stria module load ldms /opt/ovis.aarch64/INSTALLS/ovis-1bbc956/sbin/ldmsd -c /opt/ovis.aarch64/current/etc/sysconfig/ldms.d/ibtest.conf -v ERROR -n workaround@ -F ldms_ls -x rdma -h stln1-ib0 -a munge -p 415 ; works ldms_ls -x rdma -h stln1 -a munge -p 415; Segmentation fault build was with: -g -O2 and not all the usual redhat security cruft flags.

tom95858 commented 2 years ago

I believe this is an issue with the MLX5/OFA install on STRIA. I don't think it has anything to do with LDMS.

On Wed, Sep 14, 2022 at 1:52 PM Benjamin Allan @.***> wrote:

@tom95858 https://github.com/tom95858 I don't know that you have root access on stria. I've set up test daemon on stria-login1 that demonstrates the problem. When you have time to look, let me know.

test case reminders to myself: module purge module use /projects/ovis/modules/stria module load ldms /opt/ovis.aarch64/INSTALLS/ovis-1bbc956/sbin/ldmsd -c /opt/ovis.aarch64/current/etc/sysconfig/ldms.d/ibtest.conf -v ERROR -n workaround@ -F ldms_ls -x rdma -h stln1-ib0 -a munge -p 415 ; works ldms_ls -x rdma -h stln1 -a munge -p 415; Segmentation fault build was with: -g -O2 and not all the usual redhat security cruft flags.

— Reply to this email directly, view it on GitHub https://github.com/ovis-hpc/ovis/issues/897#issuecomment-1247233471, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVTPXB22KKQRVFRBQEKFPDV6IUHZANCNFSM5RIPMOZA . You are receiving this because you were mentioned.Message ID: @.***>

-- Thomas Tucker, President, Open Grid Computing, Inc.

tom95858 commented 1 year ago

@baallan please close if fixed.

baallan commented 1 day ago

As of libibverbs-41.0-1.el8.aarch64 in toss4-aka-rhel8 this is still a problem. It appears to be, stepping the binary with gdb, a bug (failure to check something) in https://github.com/linux-rdma/rdma-core/blob/master/librdmacm/cma.c, which pulls and dereferences a null pointer from struct index_map ucma_idm.

 Invalid read of size 8
    at 0x55E8E48: rdma_get_cm_event (in /usr/lib64/librdmacm.so.1.3.41.0)
    by 0x557748B: z_rdma_handle_cm_event (zap_rdma.c:2565)
    by 0x557748B: z_rdma_io_thread_proc (zap_rdma.c:2702)
    by 0x50F78B7: start_thread (in /usr/lib64/libpthread-2.28.so)
    by 0x4BD3AFB: thread_start (in /usr/lib64/libc-2.28.so)
  Address 0x0 is not stack'd, malloc'd or (recently) free'd

A proposed solution for related symptoms is to run ibacm, but our systems do not do that. https://forums.developer.nvidia.com/t/rdma-cm-connection-setup-issues/207800

So the issue can stay closed, but i'm adding this in case we ever want to look into fixing rdma upstream. The ldms_ls/ldmsd symptom caused by this bug is that if the hostname for an rdma connection is misdefined (e.g hostx instead of hostx-ib0 or the hostname defaulted to localhost or equivalent), a seg-fault will result during startup/reconfiguration.