hpc / quo-vadis

A cross-stack coordination layer to dynamically map runtime components to hardware resources
BSD 3-Clause "New" or "Revised" License
7 stars 4 forks source link

Example showcasing the issue with threads and ZMQ Sockets #163

Closed GuillaumeMercier closed 1 month ago

GuillaumeMercier commented 1 month ago

This PR demonstrates issue https://github.com/hpc/quo-vadis/issues/162

The program to test is test-pthread-split

My command line (with my Open MPI install) is : mpiexec --tag-output --report-bindings -np 1 -host localhost --map-by core:OVERSUBSCRIBE --bind-to core ./test-pthread-split

And the current output is :

mpiexec --tag-output --report-bindings -np 1  -host localhost --map-by core:OVERSUBSCRIBE  --bind-to core   ./test-pthread-split 
[1,0]<stderr>: [Palamede:1091043] Rank 0 bound to package[0][core:0]
[1,0]<stdout>: # Starting Hybrid MPI + Pthreads test
[1,0]<stdout>: =================> mutex ctxt init done @0x75039481b780
[1,0]<stdout>: [1091048] mpi_job_scope taskid is 0
[1,0]<stdout>: [1091048] mpi_job_scope ntasks is 1
[1,0]<stdout>: [0] Number of NUMANodes in mpi_scope is 1
[1,0]<stdout>: [1091048] mpi_numa_scope taskid is 0
[1,0]<stdout>: [1091048] mpi_numa_scope ntasks is 1
[1,0]<stdout>: [1091048] Current cpubind before qv_bind_push() is 0,4
[1,0]<stdout>: [1091048] New cpubind after qv_bind_push() is 0-7
[1,0]<stdout>: [0] NUMA id is 0
[1,0]<stdout>: [0] Number of Cores in mpi_numa_scope is 4
[1,0]<stdout>: [0] Number of PUs in mpi_numa_scope is 8
[1,0]<stdout>: [0] Number of threads : 8
[1,0]<stdout>: Subscope[0] ptr  = 0x5c5fc05b76d0
[1,0]<stdout>: Subscope[1] ptr  = 0x5c5fc05b78e0
[1,0]<stdout>: Subscope[2] ptr  = 0x5c5fc05ba480
[1,0]<stdout>: Subscope[3] ptr  = 0x5c5fc05bc200
[1,0]<stdout>: Subscope[4] ptr  = 0x5c5fc05bdf80
[1,0]<stdout>: Subscope[5] ptr  = 0x5c5fc05bfd00
[1,0]<stdout>: Subscope[6] ptr  = 0x5c5fc05c1a80
[1,0]<stdout>: Subscope[7] ptr  = 0x5c5fc05c3800
[1,0]<stdout>: ================ lock taken @0x5c5fc0591ce8
[1,0]<stdout>: ================ lock freed @0x5c5fc0591ce8
[1,0]<stdout>: ================ lock taken @0x5c5fc0591ce8
[1,0]<stdout>: Thread running on 0,4
[1,0]<stderr>: [quo-vadis error at (qvi-rmi.cc::qvi_zerr_msg::101)] zmq_msg_send() failed with errno=156384763 (Unknown error 156384763)
[1,0]<stderr>: [quo-vadis error at (qvi-rmi.cc::qvi_zerr_msg::101)] zmq_msg_send() truncated with errno=156384763 (Unknown error 156384763)
[1,0]<stdout>: ================ lock freed @0x5c5fc0591ce8
[1,0]<stdout>: ==== Bind Push error 
[1,0]<stdout>: ================ lock taken @0x5c5fc0591ce8
[1,0]<stdout>: Thread finished with '(null)'
[1,0]<stdout>: Thread finished with '(null)'
[1,0]<stdout>: ================ lock freed @0x5c5fc0591ce8
[1,0]<stdout>: ================ lock taken @0x5c5fc0591ce8
[1,0]<stdout>: ================ lock freed @0x5c5fc0591ce8
[1,0]<stdout>: ==== Bind Push error 
[1,0]<stdout>: ================ lock taken @0x5c5fc0591ce8
[1,0]<stderr>: [quo-vadis error at (qvi-rmi.cc::qvi_zerr_msg::101)] zmq_msg_send() failed with errno=156384763 (Unknown error 156384763)
[1,0]<stderr>: [quo-vadis error at (qvi-rmi.cc::qvi_zerr_msg::101)] zmq_msg_send() truncated with errno=156384763 (Unknown error 156384763)
[1,0]<stderr>: [quo-vadis error at (qvi-rmi.cc::qvi_zerr_msg::101)] zmq_msg_send() failed with errno=156384763 (Unknown error 156384763)
[1,0]<stderr>: [quo-vadis error at (qvi-rmi.cc::qvi_zerr_msg::101)] zmq_msg_send() truncated with errno=156384763 (Unknown error 156384763)
[1,0]<stdout>: Thread running on 2,6
[1,0]<stdout>: ================ lock taken @0x5c5fc0591ce8
[1,0]<stdout>: Thread finished with '(null)'
[1,0]<stdout>: Thread finished with '(null)'
[1,0]<stdout>: ================ lock freed @0x5c5fc0591ce8
[1,0]<stdout>: ==== Bind Push error 
[1,0]<stdout>: Thread finished with '(null)'
[1,0]<stdout>: ================ lock freed @0x5c5fc0591ce8
[1,0]<stdout>: ================ lock taken @0x5c5fc0591ce8
[1,0]<stderr>: [quo-vadis error at (qvi-rmi.cc::qvi_zerr_msg::101)] zmq_msg_send() failed with errno=156384763 (Unknown error 156384763)
[1,0]<stderr>: [quo-vadis error at (qvi-rmi.cc::qvi_zerr_msg::101)] zmq_msg_send() truncated with errno=156384763 (Unknown error 156384763)
[1,0]<stdout>: ================ lock freed @0x5c5fc0591ce8
[1,0]<stdout>: ==== Bind Push error 
[1,0]<stdout>: ================ lock taken @0x5c5fc0591ce8
[1,0]<stdout>: ================ lock freed @0x5c5fc0591ce8
[1,0]<stdout>: ==== Bind Push error 
[1,0]<stderr>: [quo-vadis error at (qvi-rmi.cc::qvi_zerr_msg::101)] zmq_msg_send() failed with errno=156384763 (Unknown error 156384763)
[1,0]<stderr>: [quo-vadis error at (qvi-rmi.cc::qvi_zerr_msg::101)] zmq_msg_send() truncated with errno=156384763 (Unknown error 156384763)
[1,0]<stdout>: Thread running on 1,5
[1,0]<stdout>: Thread finished with '(null)'
[1,0]<stdout>: Thread finished with '(null)'
[1,0]<stdout>: Thread finished with '(null)'
[1,0]<stdout>: ===================== Coucou
[1,0]<stdout>: ===================== Coucou 2
[1,0]<stdout>: ===================== Coucou 3
[1,0]<stdout>: ===================== Coucou 4

When uncommenting the call to sleep in the program, the output then becomes:

[1,0]<stderr>: [Palamede:1112008] Rank 0 bound to package[0][core:0]
[1,0]<stdout>: # Starting Hybrid MPI + Pthreads test
[1,0]<stdout>: =================> mutex ctxt init done @0x71e92941b780
[1,0]<stdout>: [1112011] mpi_job_scope taskid is 0
[1,0]<stdout>: [1112011] mpi_job_scope ntasks is 1
[1,0]<stdout>: [0] Number of NUMANodes in mpi_scope is 1
[1,0]<stdout>: [1112011] mpi_numa_scope taskid is 0
[1,0]<stdout>: [1112011] mpi_numa_scope ntasks is 1
[1,0]<stdout>: [1112011] Current cpubind before qv_bind_push() is 0,4
[1,0]<stdout>: [1112011] New cpubind after qv_bind_push() is 0-7
[1,0]<stdout>: [0] NUMA id is 0
[1,0]<stdout>: [0] Number of Cores in mpi_numa_scope is 4
[1,0]<stdout>: [0] Number of PUs in mpi_numa_scope is 8
[1,0]<stdout>: [0] Number of threads : 8
[1,0]<stdout>: Subscope[0] ptr  = 0x5ae46b71f6d0
[1,0]<stdout>: Subscope[1] ptr  = 0x5ae46b71f8e0
[1,0]<stdout>: Subscope[2] ptr  = 0x5ae46b722480
[1,0]<stdout>: Subscope[3] ptr  = 0x5ae46b724200
[1,0]<stdout>: Subscope[4] ptr  = 0x5ae46b725f80
[1,0]<stdout>: Subscope[5] ptr  = 0x5ae46b727d00
[1,0]<stdout>: Subscope[6] ptr  = 0x5ae46b729a80
[1,0]<stdout>: Subscope[7] ptr  = 0x5ae46b72b800
[1,0]<stdout>: ================ lock taken @0x5ae46b6f9ce8
[1,0]<stdout>: ================ lock freed @0x5ae46b6f9ce8
[1,0]<stdout>: Thread running on 0,4
[1,0]<stdout>: ================ lock taken @0x5ae46b6f9ce8
[1,0]<stdout>: ================ lock freed @0x5ae46b6f9ce8
[1,0]<stdout>: Thread running on 1,5
[1,0]<stdout>: ================ lock taken @0x5ae46b6f9ce8
[1,0]<stdout>: ================ lock freed @0x5ae46b6f9ce8
[1,0]<stdout>: Thread running on 2,6
[1,0]<stdout>: ================ lock taken @0x5ae46b6f9ce8
[1,0]<stdout>: ================ lock freed @0x5ae46b6f9ce8
[1,0]<stdout>: Thread running on 3,7
[1,0]<stdout>: ================ lock taken @0x5ae46b6f9ce8
[1,0]<stdout>: ================ lock freed @0x5ae46b6f9ce8
[1,0]<stdout>: Thread running on 0,4
[1,0]<stdout>: ================ lock taken @0x5ae46b6f9ce8
[1,0]<stdout>: ================ lock freed @0x5ae46b6f9ce8
[1,0]<stdout>: Thread running on 1,5
[1,0]<stdout>: ================ lock taken @0x5ae46b6f9ce8
[1,0]<stdout>: ================ lock freed @0x5ae46b6f9ce8
[1,0]<stdout>: Thread running on 2,6
[1,0]<stdout>: Thread finished with '(null)'
[1,0]<stdout>: Thread finished with '(null)'
[1,0]<stdout>: Thread finished with '(null)'
[1,0]<stdout>: Thread finished with '(null)'
[1,0]<stdout>: ================ lock taken @0x5ae46b6f9ce8
[1,0]<stdout>: Thread finished with '(null)'
[1,0]<stdout>: Thread finished with '(null)'
[1,0]<stdout>: Thread finished with '(null)'
[1,0]<stdout>: ================ lock freed @0x5ae46b6f9ce8
[1,0]<stdout>: Thread running on 3,7
[1,0]<stdout>: Thread finished with '(null)'
[1,0]<stdout>: ===================== Coucou
[1,0]<stdout>: ===================== Coucou 2
[1,0]<stdout>: ===================== Coucou 3
[1,0]<stdout>: ===================== Coucou 4
GuillaumeMercier commented 1 month ago

@samuelkgutierrez: I'de more comfortable is this was put in a branch and not master, actually.

GuillaumeMercier commented 1 month ago

@samuelkgutierrez : all these checks are pretty annoying (to say the least). I don't know how to fix the last one. The assert doesn't seem to be detected.

samuelkgutierrez commented 1 month ago

@samuelkgutierrez : all these checks are pretty annoying (to say the least). I don't know how to fix the last one.

The assert doesn't seem to be detected.

To fix: read or put ret2 in a QVI_UNUSED.

samuelkgutierrez commented 1 month ago

@samuelkgutierrez: I'de more comfortable is this was put in a branch and not master, actually.

I think you have that ability. Try pushing it to a branch. If you can't, I'll see what I can do.

GuillaumeMercier commented 1 month ago

@samuelkgutierrez : I fixed the warning/error isssue and created a new branch. I thus closed this PR but the description is still valid.