huttered40 / critter

Critical path analysis of MPI parallel programs
BSD 2-Clause "Simplified" License
2 stars 1 forks source link

Fix bug in ctf_ex cholinv #28

Closed huttered40 closed 4 years ago

huttered40 commented 4 years ago

A bug has been verified in ctf_ex_debug/main.cxx in which simple matrix multiplication is incorrect if critter::start and critter::stop guards are placed around the matrix multiplication.

First step in debugging would be to comment out the nonblocking support in critter.h and rebuild CTF. If bug does not exist, then clearly there is a problem with our overlapping support.

huttered40 commented 4 years ago

Verified that with all nonblocking support commented out in critter, ctf cholinv works just fine.

Next step: verify correctness if MPI_Isend and MPI_Irecv are kept (with MPI_Reqest map commented out), but the MPI_Wait variants commented out.

huttered40 commented 4 years ago

Verified that with the changes described above, the program is still correct. This tells me that the error has to do completely with critter's handling of the MPI_Wait variants. Note that with any MPI_Waitall, the ordering of the requests being completed should not matter, and we exploit this via a loop over MPI_Waitany. So I don't think the problem is in our handling of that.

huttered40 commented 4 years ago

Ok I have finally found the issue:

MPI_Bcast is the problem. I have commented out the start and stop for the interception of that routine only, and cholinv is correct! Non-blocking support is not the problem.

I have no idea why MPI_Bcast would be the problem though, because it works with cacqr2. Perhaps corruption of some buffers in stop?

huttered40 commented 4 years ago

After more debugging, I have isolated the issue, but it still makes no sense to me.

compute_all_crit suffers from an issue seemingly only when called from _critter_bcast::stop(). The problem is in get_crit_data, and I have taken that loop over all critters and broken it down into 19 individual get_crit_data calls with fixed indices. This also fails, but doesn't fail if I print the size of old_cs before doing so.

This makes the entire thing ridiculous and makes me think its a racetime error. However, there should be only a single thread running per process (unless ctf is doing something weird internally).

huttered40 commented 4 years ago

With critter, ctf, and ctf_ex all compiled with -g -O0, the bug is still present.

huttered40 commented 4 years ago

I separated compute_all_crit into two separate routines: one for MPI_Bcast and one for all others. I have found that if I replace the communicator with MPI_COMM_SELF, I get correctness.

I am not sure of what to make of this. I have tested MPI_Bcast using cacqr2, so the problem has to be in high CTF calls MPI_Bcast, right?

Could it have to do with CTF using too many communicators? That would depend on the recursion depth probably, and how CTF deals with communicators internally.

huttered40 commented 4 years ago

If I change the ordering of PMPI_Bcast and stop() in the interception of MPI_Bcast in critter.h, I get correctness. Also, if I put PMPI_Bcast before both start and stop, I get correctness. Its only when PMPI_Bcast goes in the middle that I get a correctness issue. It really doesn't make any sense.

So now what does this mean? That buf gets corrupted in stop()? Can't be, because then the send buffer would get corrupted before PMPI_Bcast.

huttered40 commented 4 years ago

I tried changing the name buf to bbuf but still get problem.

huttered40 commented 4 years ago

Next steps: Lets run a critter job with MPI_Bcast interception relegated to putting critter::start and critter::end before PMPI_Bcast.

The goal for this will be to see whether we get any assertion errors on larger problem sizes with more processes and further recursion depths.

huttered40 commented 4 years ago

The goal for this will be to see whether we get any assertion errors on larger problem sizes with more processes and further recursion depths.

First test: we have a hang in the very first test of ctf cholinv: (8192,3,-2). Is it due to the negative in the test file? Is it due to recursing too deep? Due to Scalapack Cholesky? It didn't throw an SPD error, so what the hell happened?

huttered40 commented 4 years ago

The goal for this will be to see whether we get any assertion errors on larger problem sizes with more processes and further recursion depths.

First test: we have a hang in the very first test of ctf cholinv: (8192,3,-2). Is it due to the negative in the test file? Is it due to recursing too deep? Due to Scalapack Cholesky? It didn't throw an SPD error, so what the hell happened?

Figured this isolated issue out: i had compiled a different version of critter with ctf than with ctf_ex.

huttered40 commented 4 years ago

The first large tests did not help though: I get a hangs in each at 1,8,64 nodes.

In addition, files have a negative integer key in them that makes it impossible for me to access. I will address this by creating a special custom function stringify instead of str.

We also need to add a special case to the instructions file so that the base case dimensions does not get too small relative to the number of processes.

For sanity sake, even though the test above is not completely analyzed, lets get rid of critter::start and critter::stop in MPI_Bcast interception and see if we can get actual results.

huttered40 commented 4 years ago

Lets just create a log:

huttered40 commented 4 years ago

Problems identified after running the tests above:

huttered40 commented 4 years ago

I just found and (hopefully) fixed a clear error in compute_all_avg. I do not think this could have caused the original loss of SPD issue, but I think it will fix perhaps some of the issues.

huttered40 commented 4 years ago

I just found and (hopefully) fixed a clear error in compute_all_avg. I do not think this could have caused the original loss of SPD issue, but I think it will fix perhaps some of the issues.

Still get same strange error of .25 for each variant at 64 nodes, and 0 error at 8 nodes.

huttered40 commented 4 years ago

The take-away from all these tests is the following:

Even the mere presence of the preprocessor directive intercepting MPI_Bcast is enough to cause a residual error of .25 on 64 nodes. The residual error on 8 nodes is 0. MPI_Bcast is the only MPI routine used by ctf cholinv to encounter this problem. My thinking is that the problem must be coming from within ctf.

huttered40 commented 4 years ago

Big fixed. One thing to note is to never treat MPI_Requests by value. That was a huge problem I had with critter that I had no idea was causing problems.