Closed huttered40 closed 4 years ago
Verified that with all nonblocking support commented out in critter
, ctf
cholinv
works just fine.
Next step: verify correctness if MPI_Isend
and MPI_Irecv
are kept (with MPI_Reqest
map commented out), but the MPI_Wait
variants commented out.
Verified that with the changes described above, the program is still correct. This tells me that the error has to do completely with critter
's handling of the MPI_Wait
variants. Note that with any MPI_Waitall
, the ordering of the requests being completed should not matter, and we exploit this via a loop over MPI_Waitany
. So I don't think the problem is in our handling of that.
Ok I have finally found the issue:
MPI_Bcast
is the problem. I have commented out the start
and stop
for the interception of that routine only, and cholinv
is correct! Non-blocking support is not the problem.
I have no idea why MPI_Bcast
would be the problem though, because it works with cacqr2
. Perhaps corruption of some buffers in stop
?
After more debugging, I have isolated the issue, but it still makes no sense to me.
compute_all_crit
suffers from an issue seemingly only when called from _critter_bcast::stop()
. The problem is in get_crit_data
, and I have taken that loop over all critters and broken it down into 19 individual get_crit_data
calls with fixed indices. This also fails, but doesn't fail if I print the size of old_cs
before doing so.
This makes the entire thing ridiculous and makes me think its a racetime error. However, there should be only a single thread running per process (unless ctf is doing something weird internally).
With critter
, ctf
, and ctf_ex
all compiled with -g -O0
, the bug is still present.
I separated compute_all_crit
into two separate routines: one for MPI_Bcast
and one for all others. I have found that if I replace the communicator with MPI_COMM_SELF
, I get correctness.
I am not sure of what to make of this. I have tested MPI_Bcast
using cacqr2
, so the problem has to be in high CTF calls MPI_Bcast
, right?
Could it have to do with CTF using too many communicators? That would depend on the recursion depth probably, and how CTF deals with communicators internally.
If I change the ordering of PMPI_Bcast
and stop()
in the interception of MPI_Bcast
in critter.h, I get correctness. Also, if I put PMPI_Bcast
before both start
and stop
, I get correctness. Its only when PMPI_Bcast
goes in the middle that I get a correctness issue. It really doesn't make any sense.
So now what does this mean? That buf
gets corrupted in stop()
? Can't be, because then the send buffer would get corrupted before PMPI_Bcast
.
I tried changing the name buf
to bbuf
but still get problem.
Next steps:
Lets run a critter job with MPI_Bcast
interception relegated to putting critter::start
and critter::end
before PMPI_Bcast
.
The goal for this will be to see whether we get any assertion errors on larger problem sizes with more processes and further recursion depths.
The goal for this will be to see whether we get any assertion errors on larger problem sizes with more processes and further recursion depths.
First test: we have a hang in the very first test of ctf cholinv
: (8192,3,-2). Is it due to the negative in the test file? Is it due to recursing too deep? Due to Scalapack Cholesky? It didn't throw an SPD error, so what the hell happened?
The goal for this will be to see whether we get any assertion errors on larger problem sizes with more processes and further recursion depths.
First test: we have a hang in the very first test of
ctf cholinv
: (8192,3,-2). Is it due to the negative in the test file? Is it due to recursing too deep? Due to Scalapack Cholesky? It didn't throw an SPD error, so what the hell happened?
Figured this isolated issue out: i had compiled a different version of critter with ctf than with ctf_ex.
The first large tests did not help though: I get a hangs in each at 1,8,64 nodes.
In addition, files have a negative integer key in them that makes it impossible for me to access. I will address this by creating a special custom function stringify
instead of str
.
We also need to add a special case to the instructions file so that the base case dimensions does not get too small relative to the number of processes.
For sanity sake, even though the test above is not completely analyzed, lets get rid of critter::start
and critter::stop
in MPI_Bcast
interception and see if we can get actual results.
Lets just create a log:
critter::start
and critter::stop
in ctf_ex::cholinv
critter::start
and critter::stop
in ctf_ex::cholinv
, but only PMPI
inside critter
itself (Critter+CTF build must match that used to build ctf_ex::cholinv
Allgather
, Allgatherv
, Alltoall
, and Alltoallv
.compute_all_crit
commented out in critter::stop
that ctf cholinv
calls directlycompute_all_avg
commented out in critter::stop
that ctf cholinv
calls directlycritter::track
branch so that interception goes straight to PMPI
.MPI_Bcast
compute_all_crit
and compute_all_avg
in critter::stop
MPI_Bcast
, the code for which is still removed entirely.Problems identified after running the tests above:
n=8192
on a single node with ppn=64
seems to fail mysteriously for base case parameter -1,0,1,2
. Note that although it produces output and doesnt hang for bc parameter -2
, each iteration takes about 120 seconds. This seems to occur in test 1 and test 2 above, so it happens regardless as to whether critter
is being used to intercept MPI calls.critter
and not using any interception is causing numerical error. I see that the reason for this is probably because we call compute_all_crit
and compute_all_avg
inside critter::stop
.critter
's nonblocking support is not causing problems, nor is compute_all_avg
or compute_all_crit
. At this point, I want to get rid of all critter interception and try again.PMPI
(perhaps those in tests 3,4 aboveMPI_Bcast
, since in test 9 I left it as the only one removed, and 64-node test ran with no error.MPI_Bcast
interception directives are left in, never mind if they are used at all.MPI_Bcast
I just found and (hopefully) fixed a clear error in compute_all_avg
. I do not think this could have caused the original loss of SPD issue, but I think it will fix perhaps some of the issues.
I just found and (hopefully) fixed a clear error in
compute_all_avg
. I do not think this could have caused the original loss of SPD issue, but I think it will fix perhaps some of the issues.
Still get same strange error of .25 for each variant at 64 nodes, and 0 error at 8 nodes.
The take-away from all these tests is the following:
Even the mere presence of the preprocessor directive intercepting MPI_Bcast
is enough to cause a residual error of .25 on 64 nodes. The residual error on 8 nodes is 0. MPI_Bcast
is the only MPI routine used by ctf cholinv
to encounter this problem. My thinking is that the problem must be coming from within ctf
.
Big fixed. One thing to note is to never treat MPI_Requests by value. That was a huge problem I had with critter that I had no idea was causing problems.
A bug has been verified in ctf_ex_debug/main.cxx in which simple matrix multiplication is incorrect if
critter::start
andcritter::stop
guards are placed around the matrix multiplication.First step in debugging would be to comment out the nonblocking support in critter.h and rebuild CTF. If bug does not exist, then clearly there is a problem with our overlapping support.