Open dalcinl opened 3 years ago
Working fine for me.
@tonycurtis What UCX version are you using? What system you tested on?
This is my package list from system repositories:
$ rpm -qa | egrep 'ucx|pmix'
pmix-3.2.3-1.fc33.x86_64
ucx-1.10.1-1.fc33.x86_64
pmix-devel-3.2.3-1.fc33.x86_64
ucx-devel-1.10.1-1.fc33.x86_64
ucx-cma-1.10.1-1.fc33.x86_64
pmix-pmi-3.2.3-1.fc33.x86_64
pmix-pmi-devel-3.2.3-1.fc33.x86_64
as well as mpiexec for launcher (Open MPI 4.0.5, also from system packages), and this is my osss-ucx configure line:
./configure --prefix=/home/devel/shmem/osss --with-ucx --with-pmix --enable-debug --enable-logging
Any tips on how to further debug the issue on my side?
UCX git latest, 1.10.1. Tested on 2 different ARM Infiniband clusters, plus Fedora x86 VM (with knem transport). Also PMIx 3.2.3.
I appear to have eventually triggered the behavior you see. My quiet() is still using a nearly-deprecated UCX call: will try updating.
The cluster I was using went down over the weekend for some reason, just managed to find somewhere else that demonstrates this behavior. I suspect my lock code is aging badly with relaxed memory on e.g. ARM. Will take more investigation/rethink.
Please note that I'm running the reproducer in a rather unremarkable x64_64 desktop machine, Linux 5.12.7, no knem.
Yeah, it's behaving differently on 2 ARM systems too. The lock code comes from an earlier shmem implementation, may need to modernize. memory consistency is becoming much more difficult.
Have a go now. I think I've fixed it (make sure to be using the "main" branch)
Now have problems with shmem_test_lock
, run the following with at least two PEs. It is almost the same code, but instead of shmem_set_lock()
, I'm using while (shmem_test_lock);
:
#include <shmem.h>
#include <stdio.h>
int main(void) {
static int count = 0;
shmem_init();
#if 0
static long _lock = 0;
long *lock = &_lock;
#else
long *lock = shmem_malloc(sizeof(long));
*lock = 0;
shmem_barrier_all();
#endif
int mype = shmem_my_pe();
while (shmem_test_lock(lock));
int val = shmem_g(&count, 0); /* get count value on PE 0 */
printf("%d: count is %d\n", mype, val);
val++; /* incrementing and updating count on PE 0 */
shmem_p(&count, val, 0);
shmem_clear_lock(lock); /* ensures count update completes before clearing the lock */
shmem_finalize();
return 0;
}
I'm running Fedora 33 with system packages openmpi-4.0.5, ucx-1.10.1, gcc-10.3.1
Reproducer (slight modification of the OpenSHMEM 1.5 spec document):
Note: If you switch
#if 0
->#if 1
, then things work as expected: