Fix bug in ctf_ex cholinv

huttered40 commented 4 years ago

A bug has been verified in ctf_ex_debug/main.cxx in which simple matrix multiplication is incorrect if critter::start and critter::stop guards are placed around the matrix multiplication.

First step in debugging would be to comment out the nonblocking support in critter.h and rebuild CTF. If bug does not exist, then clearly there is a problem with our overlapping support.

huttered40 commented 4 years ago

Verified that with all nonblocking support commented out in critter, ctf cholinv works just fine.

Next step: verify correctness if MPI_Isend and MPI_Irecv are kept (with MPI_Reqest map commented out), but the MPI_Wait variants commented out.

huttered40 commented 4 years ago

Verified that with the changes described above, the program is still correct. This tells me that the error has to do completely with critter's handling of the MPI_Wait variants. Note that with any MPI_Waitall, the ordering of the requests being completed should not matter, and we exploit this via a loop over MPI_Waitany. So I don't think the problem is in our handling of that.

huttered40 commented 4 years ago

Ok I have finally found the issue:

MPI_Bcast is the problem. I have commented out the start and stop for the interception of that routine only, and cholinv is correct! Non-blocking support is not the problem.

I have no idea why MPI_Bcast would be the problem though, because it works with cacqr2. Perhaps corruption of some buffers in stop?

huttered40 commented 4 years ago

After more debugging, I have isolated the issue, but it still makes no sense to me.

compute_all_crit suffers from an issue seemingly only when called from _critter_bcast::stop(). The problem is in get_crit_data, and I have taken that loop over all critters and broken it down into 19 individual get_crit_data calls with fixed indices. This also fails, but doesn't fail if I print the size of old_cs before doing so.

This makes the entire thing ridiculous and makes me think its a racetime error. However, there should be only a single thread running per process (unless ctf is doing something weird internally).

huttered40 commented 4 years ago

With critter, ctf, and ctf_ex all compiled with -g -O0, the bug is still present.

huttered40 commented 4 years ago

I separated compute_all_crit into two separate routines: one for MPI_Bcast and one for all others. I have found that if I replace the communicator with MPI_COMM_SELF, I get correctness.

I am not sure of what to make of this. I have tested MPI_Bcast using cacqr2, so the problem has to be in high CTF calls MPI_Bcast, right?

Could it have to do with CTF using too many communicators? That would depend on the recursion depth probably, and how CTF deals with communicators internally.

huttered40 commented 4 years ago

If I change the ordering of PMPI_Bcast and stop() in the interception of MPI_Bcast in critter.h, I get correctness. Also, if I put PMPI_Bcast before both start and stop, I get correctness. Its only when PMPI_Bcast goes in the middle that I get a correctness issue. It really doesn't make any sense.

So now what does this mean? That buf gets corrupted in stop()? Can't be, because then the send buffer would get corrupted before PMPI_Bcast.

huttered40 commented 4 years ago

I tried changing the name buf to bbuf but still get problem.

huttered40 commented 4 years ago

Next steps: Lets run a critter job with MPI_Bcast interception relegated to putting critter::start and critter::end before PMPI_Bcast.

The goal for this will be to see whether we get any assertion errors on larger problem sizes with more processes and further recursion depths.

huttered40 commented 4 years ago

The goal for this will be to see whether we get any assertion errors on larger problem sizes with more processes and further recursion depths.

First test: we have a hang in the very first test of ctf cholinv: (8192,3,-2). Is it due to the negative in the test file? Is it due to recursing too deep? Due to Scalapack Cholesky? It didn't throw an SPD error, so what the hell happened?

huttered40 commented 4 years ago

The goal for this will be to see whether we get any assertion errors on larger problem sizes with more processes and further recursion depths.

First test: we have a hang in the very first test of ctf cholinv: (8192,3,-2). Is it due to the negative in the test file? Is it due to recursing too deep? Due to Scalapack Cholesky? It didn't throw an SPD error, so what the hell happened?

Figured this isolated issue out: i had compiled a different version of critter with ctf than with ctf_ex.

huttered40 commented 4 years ago

The first large tests did not help though: I get a hangs in each at 1,8,64 nodes.

In addition, files have a negative integer key in them that makes it impossible for me to access. I will address this by creating a special custom function stringify instead of str.

We also need to add a special case to the instructions file so that the base case dimensions does not get too small relative to the number of processes.

For sanity sake, even though the test above is not completely analyzed, lets get rid of critter::start and critter::stop in MPI_Bcast interception and see if we can get actual results.

huttered40 commented 4 years ago

Lets just create a log:

[x] Critter+CTF build, with no critter::start and critter::stop in ctf_ex::cholinv
[x] Critter+CTF build, with critter::start and critter::stop in ctf_ex::cholinv, but only PMPI inside critter itself (Critter+CTF build must match that used to build ctf_ex::cholinv
[x] Lets do the above, but add back in the nonblocking support only
[x] Lets do the above, but add back in support for Allgather, Allgatherv, Alltoall, and Alltoallv.
[x] Same as test 2 above, but with compute_all_crit commented out in critter::stop that ctf cholinv calls directly
[x] Same as test 5 above, but with compute_all_avg commented out in critter::stop that ctf cholinv calls directly
[x] If test 6 above produces the same numerical error at 64 nodes, comment out all interception directives and try again.
[x] Same as test 7 above, but introduce back in the critter support for nonblocking communication and the gathers allgathers, and alltoall variants, but still reverse the critter::track branch so that interception goes straight to PMPI.
[x] Same as test 8 above, but introduce back all critter support except MPI_Bcast
[x] Same as test 9 above, but introduce back all compute_all_crit and compute_all_avg in critter::stop
[x] Same as test 10 above, but flip the branches on each interception, so that everything gets intercepted except MPI_Bcast, the code for which is still removed entirely.

huttered40 commented 4 years ago

Problems identified after running the tests above:

[ ] n=8192 on a single node with ppn=64 seems to fail mysteriously for base case parameter -1,0,1,2. Note that although it produces output and doesnt hang for bc parameter -2, each iteration takes about 120 seconds. This seems to occur in test 1 and test 2 above, so it happens regardless as to whether critter is being used to intercept MPI calls.
[ ] The 2nd test above gives an error of .25 for most tests (although not for any variant run with 8 nodes for some reason). This is in contrast to test 1 above, in which error is good for each variant. This tells me that even just including critter and not using any interception is causing numerical error. I see that the reason for this is probably because we call compute_all_crit and compute_all_avg inside critter::stop.
[ ] The 3rd,4th, 5th, and 6th tests give the same error and behavior as the 2nd, which means that critter's nonblocking support is not causing problems, nor is compute_all_avg or compute_all_crit. At this point, I want to get rid of all critter interception and try again.
[ ] The 7th test above gives a numerical error at 64 nodes of 0, meaning that there must be something wrong with the interception directives. Lets comment back in just a few at a time but use PMPI (perhaps those in tests 3,4 above
[ ] The only interception directive there seems to be a problem with is MPI_Bcast, since in test 9 I left it as the only one removed, and 64-node test ran with no error.
[ ] Test 10 showed that the problem exists merely if MPI_Bcast interception directives are left in, never mind if they are used at all.
[ ] Test 11 showed that the problem must be in MPI_Bcast

huttered40 commented 4 years ago

I just found and (hopefully) fixed a clear error in compute_all_avg. I do not think this could have caused the original loss of SPD issue, but I think it will fix perhaps some of the issues.

huttered40 commented 4 years ago

I just found and (hopefully) fixed a clear error in compute_all_avg. I do not think this could have caused the original loss of SPD issue, but I think it will fix perhaps some of the issues.

Still get same strange error of .25 for each variant at 64 nodes, and 0 error at 8 nodes.

huttered40 commented 4 years ago

The take-away from all these tests is the following:

Even the mere presence of the preprocessor directive intercepting MPI_Bcast is enough to cause a residual error of .25 on 64 nodes. The residual error on 8 nodes is 0. MPI_Bcast is the only MPI routine used by ctf cholinv to encounter this problem. My thinking is that the problem must be coming from within ctf.

huttered40 commented 4 years ago

Big fixed. One thing to note is to never treat MPI_Requests by value. That was a huge problem I had with critter that I had no idea was causing problems.

huttered40 / critter

Fix bug in ctf_ex cholinv #28