Open carns opened 1 year ago
I'll try an environment that uses release versions of argobots and mercury to make sure there isn't some interaction there.
The only two ways I could see this happening are:
ABT_thread_yield()
after margo_forward
.t1 = ABT_get_wtime()
line, so finalize_and_wait
doesn't have to wait for the RPC to complete. I find this unlikely since margo_thread_sleep
in the RPC should yield to other ULTs, in particular the main ULT.Confirmed that the failure only happens with mercury@master (and is still present with current @master). The argobots version is irrelevant.
1) seems plausible; perhaps something changed in mercury that altered the timing.
I'm not sure if ABT_thread_yield()
would be sufficient to guarantee the timing we want in the permutations that use a dedicated progress thread. It will only yield to other ULTs eligible to run on the same ES (of which there may be none), right?
The test case is a "self" rpc, though, we could possibly just synchronize between the RPC fn and the main fn before the former sleeps to make sure that it is underway before the latter tries to finalize?
That's a good option, yes.
Whoops. We mis-diagnosed the problem. The RPC handler ult was actually starting in time, but the margo_forward()
takes a full 1 second, as if the disable_response()
call didn't work.
I'm seeing a lot of failures that look like this in origin/main:
I believe the assertion is meant to confirm that
finalize_and_wait()
doesn't complete until pending RPCs are done. I'm not sure why I would see this problem now, though.My spack environment looks like this: