Closed tgvoskuilen closed 3 years ago
@trilinos/amesos @trilinos/amesos2 @srajama1
The poster would like to express a sense of urgency about this particular issue. The work-around is not satisfactory and the customers are getting more and more unhappy.
In particular, Tyler used the phrase "L2 Milestone FY17 Q4." I'm getting details about solvers now.
@krcb @MicheldeMessieres : Will you have some time to help out here ? I believe you are already using Amesos and UMFPACK for testing. It is totally fine if you are not able to do it.
Discussions regarding @MicheldeMessieres and my availability are in progress.
@srajama1 @mhoemmen This could easily be because the status of the UMFPACK solver is not being broadcast to the other processors (I just looked at the code). KLU used to have this issue before the MPI broadcasts were inserted from processor 0 after the calls to symbolic fact., numeric fact., and solve. Also, if this UMFPACK solver is being used in conjunction with Ifpack as a preconditioner (block Jacobi), I still think there is an issue with singular subdomain solves. These are just the ideas that pop up in my head.
@hkthorn: Thanks for the pointers. We are still waiting to get some input from @tgvoskuilen.
I was not using it with a preconditioner. I did try the same case with KLU and it did not hang. Let me know if you need other information from me.
@tgvoskuilen Does KLU perform well enough for you to use as a work-around, or do you really need UMFPACK fixed? @hkthorn 's comment suggests that the cause of this issue could be UMFPACK disliking your matrix. If that's the cause, then her suggested fix would just give you better error reporting without hangs; it wouldn't actually let you factor the matrix with UMFPACK.
So far KLU has been performing well enough to use for our problem as a work-around. Solve times are comparable to UMFPACK and it hasn't experienced the parallel hang.
I don't think it's an incompatibility with UMFPACK and my matrix though, since if I change the number of processors (or solve on 1 processor) UMFPACK handles it fine.
@tgvoskuilen : Thanks for the info. Can you send me in an e-mail (not in github) the milestone this is related to and some details on the problem you are trying to solve.
I thought there was some sort of pull request relating to this issue. Was there?
No, two different things. Amesos vs Amesos2. This one is probably next in the queue.
@tgvoskuilen I tested PR #1436 (includes fix suggested by @hkthorn ) with the matrices provided. With my driver code and the provided matrices it passed on a system with 16-core Haswell nodes, specifically the case with 12 mpi processes. However, I was not able to replicate the issue with the Trilinos develop branch without the PR fix. The PR has been accepted and merged into the develop branch; to ensure this is resolved would, would you be able to test and see if the issue is resolve, or alternatively could you send me a driver code that replicates the issue that I can test with?
This issue should stay open until nightly tests with umfpack are added, issue #1255 @trilinos/framework @jwillenbring @srajama1
This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and remove the MARKED_FOR_CLOSURE
label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE
.
This issue was closed due to inactivity for 395 days.
I have been experiencing intermittent parallel hangs on multi-processor jobs using umfpack (solver block below). I've run it in the debugger and it's just sitting at two different MPI calls within the umfpack solver code. I captured matrix files for the case where it hangs and can provide them too.
begin trilinos equation solver directVerbose solution method = amesos-umfpack fei output level = matrix_files end