In this issue, I summarize the issues/lessons that I found/learned while developing MPIPETScDistributePLaplacianTests.jl, a parallel MPI program for the solution of the p-Laplacian non-linear PDE. As agreed, the efforts have been put in trying to leverage the NLsolve.jl Julia package. In the development process, I could explore, as a result of marathonian debugging sessions, quite deep into the call path within this package that is triggered when setting up the nonlinear solver in Gridap.jl as nls = NLSolver(show_trace=true, method=:newton). All observations refer to the current version of NLsolve.jl, which at the date of writing it is v4.4.0
Although I did not explore NLSolve.jl in full, and thus the observations in the sequel may not necessarily apply to other parts of the sw package, I am not sure if this package has been carefully developed with HPC/performance in mind, or it is rather though as a toy/academic software effort. In particular, I have the following concerns:
It performs a deep copy of the Jacobian matrix and residual vector in its own data structures right at the beginning of the nonlinear problem solution process. See here.
Depending on the combination of input parameters, the Newton solver ends performing unnecessary, costly computations. See this issue that I opened in NLsolve.jl. This is particularly severe with PETSc.jl, as, currently, transpose(A) does not return a lazy transpose of A, but explicitly builds the transpose matrix. (This can be worked out in PETSc.jl, though)
Perhaps with fault tolerance in mind, the Newton solver performs, at each iteration, an O(n) check on the current iterate solution in the search for NaNs. See here. It would be great if these sort of checks could be optionally activated/deactivated.
Apart from these concerns, NLsolve.jl uses extensively generic Julia interfaces, mainly AbstractArray and broadcasting interfaces. While some the methods in these interfaces are already efficiently supported by PETSc.jl, some other are not either supported at all or supported efficiently with the generic ÀbstractArray.jl fallbacks. In order to have MPIPETScDistributePLaplacianTests.jl working, I had to overload a number of different methods for PETSc.Vec{Float64}. In the sequel, I will enumerate what I had to overload and why. At present, these method definitions are written at the driver lever, but obviously we will have to move them to the appropriate packages. As I mentioned earlier, I did not go through the whole NLsolve.jl package, so that it is likely that there are other methods to be defined as well if we switch to other solvers in NLsolve.jl.
The minimal broadcasting-related methods required to efficiently support this line. See lines start-end.
Defined in LinearAlgebra module: mul! and rmul!. Required to support this, and this line, resp.
Defined in NLSolversBase module: x_of_nans. Required to support this line. I think this particular line could be supported by extending the broadcast machinery above with op == Number; see the following line.
Defined in NLsolve module: check_isfinite. Required to support this line.
Defined in Distances module: SqEuclidean. Required to support this line. See also here.
Finally, I also have the following concerns:
Once the Jacobian matrix has been built for the first time, further Jacobian assembly operations are such that cell matrix entries are assembled into the global matrix on a one-by-one basis; see here and here. The same observation applies to residual assembly. I think that this may hit performance when dealing with PETSc matrices and vectors (I did not evaluate how severe this actually is). The highest granularity function in the PETSc API lets one add a 2D dense matrix (typically a cell matrix) into the global matrix in one shot, so that it would be more performant that we inject the whole cell matrix. Should we modify Gridap.jl such that the granularity of these operations is increased?
In this issue, I summarize the issues/lessons that I found/learned while developing
MPIPETScDistributePLaplacianTests.jl
, a parallel MPI program for the solution of the p-Laplacian non-linear PDE. As agreed, the efforts have been put in trying to leverage theNLsolve.jl
Julia package. In the development process, I could explore, as a result of marathonian debugging sessions, quite deep into the call path within this package that is triggered when setting up the nonlinear solver inGridap.jl
asnls = NLSolver(show_trace=true, method=:newton)
. All observations refer to the current version ofNLsolve.jl
, which at the date of writing it isv4.4.0
Although I did not explore
NLSolve.jl
in full, and thus the observations in the sequel may not necessarily apply to other parts of the sw package, I am not sure if this package has been carefully developed with HPC/performance in mind, or it is rather though as a toy/academic software effort. In particular, I have the following concerns:NLsolve.jl
. This is particularly severe withPETSc.jl
, as, currently,transpose(A)
does not return a lazy transpose ofA
, but explicitly builds the transpose matrix. (This can be worked out inPETSc.jl
, though)Apart from these concerns,
NLsolve.jl
uses extensively generic Julia interfaces, mainlyAbstractArray
and broadcasting interfaces. While some the methods in these interfaces are already efficiently supported byPETSc.jl
, some other are not either supported at all or supported efficiently with the genericÀbstractArray.jl
fallbacks. In order to haveMPIPETScDistributePLaplacianTests.jl
working, I had to overload a number of different methods forPETSc.Vec{Float64}
. In the sequel, I will enumerate what I had to overload and why. At present, these method definitions are written at the driver lever, but obviously we will have to move them to the appropriate packages. As I mentioned earlier, I did not go through the wholeNLsolve.jl
package, so that it is likely that there are other methods to be defined as well if we switch to other solvers inNLsolve.jl
.Base
module:copyto!(...)
,maximum(...)
,any(...)
forPETSc.Vec{Float64}
vectors. Required to support this, this, and this line, resp.LinearAlgebra
module:mul!
andrmul!
. Required to support this, and this line, resp.NLSolversBase
module:x_of_nans
. Required to support this line. I think this particular line could be supported by extending the broadcast machinery above withop == Number
; see the following line.NLsolve
module:check_isfinite
. Required to support this line.Distances
module:SqEuclidean
. Required to support this line. See also here.Finally, I also have the following concerns:
Once the Jacobian matrix has been built for the first time, further Jacobian assembly operations are such that cell matrix entries are assembled into the global matrix on a one-by-one basis; see here and here. The same observation applies to residual assembly. I think that this may hit performance when dealing with PETSc matrices and vectors (I did not evaluate how severe this actually is). The highest granularity function in the PETSc API lets one add a 2D dense matrix (typically a cell matrix) into the global matrix in one shot, so that it would be more performant that we inject the whole cell matrix. Should we modify
Gridap.jl
such that the granularity of these operations is increased?The output generated by
NLsolve.jl
on screen (triggered withshow_trace=true
) is shown by all MPI tasks simultaneously, so that the output guests messed up. See, e.g., https://travis-ci.com/github/gridap/GridapDistributed.jl/jobs/359332310#L1008