Issues found/lessons learned with MPI-parallel p-Laplacian solver using `NLsolve.jl`

In this issue, I summarize the issues/lessons that I found/learned while developing MPIPETScDistributePLaplacianTests.jl, a parallel MPI program for the solution of the p-Laplacian non-linear PDE. As agreed, the efforts have been put in trying to leverage the NLsolve.jl Julia package. In the development process, I could explore, as a result of marathonian debugging sessions, quite deep into the call path within this package that is triggered when setting up the nonlinear solver in Gridap.jl as nls = NLSolver(show_trace=true, method=:newton). All observations refer to the current version of NLsolve.jl, which at the date of writing it is v4.4.0

Although I did not explore NLSolve.jl in full, and thus the observations in the sequel may not necessarily apply to other parts of the sw package, I am not sure if this package has been carefully developed with HPC/performance in mind, or it is rather though as a toy/academic software effort. In particular, I have the following concerns:

It performs a deep copy of the Jacobian matrix and residual vector in its own data structures right at the beginning of the nonlinear problem solution process. See here.
Depending on the combination of input parameters, the Newton solver ends performing unnecessary, costly computations. See this issue that I opened in NLsolve.jl. This is particularly severe with PETSc.jl, as, currently, transpose(A) does not return a lazy transpose of A, but explicitly builds the transpose matrix. (This can be worked out in PETSc.jl, though)
Perhaps with fault tolerance in mind, the Newton solver performs, at each iteration, an O(n) check on the current iterate solution in the search for NaNs. See here. It would be great if these sort of checks could be optionally activated/deactivated.

Apart from these concerns, NLsolve.jl uses extensively generic Julia interfaces, mainly AbstractArray and broadcasting interfaces. While some the methods in these interfaces are already efficiently supported by PETSc.jl, some other are not either supported at all or supported efficiently with the generic ÀbstractArray.jl fallbacks. In order to have MPIPETScDistributePLaplacianTests.jl working, I had to overload a number of different methods for PETSc.Vec{Float64}. In the sequel, I will enumerate what I had to overload and why. At present, these method definitions are written at the driver lever, but obviously we will have to move them to the appropriate packages. As I mentioned earlier, I did not go through the whole NLsolve.jl package, so that it is likely that there are other methods to be defined as well if we switch to other solvers in NLsolve.jl.

The minimal broadcasting-related methods required to efficiently support this line. See lines start-end.
Defined in Base module: copyto!(...), maximum(...), any(...) for PETSc.Vec{Float64} vectors. Required to support this, this, and this line, resp.
Defined in LinearAlgebra module: mul! and rmul!. Required to support this, and this line, resp.
Defined in NLSolversBase module: x_of_nans. Required to support this line. I think this particular line could be supported by extending the broadcast machinery above with op == Number; see the following line.
Defined in NLsolve module: check_isfinite. Required to support this line.
Defined in Distances module: SqEuclidean. Required to support this line. See also here.

Finally, I also have the following concerns:

Once the Jacobian matrix has been built for the first time, further Jacobian assembly operations are such that cell matrix entries are assembled into the global matrix on a one-by-one basis; see here and here. The same observation applies to residual assembly. I think that this may hit performance when dealing with PETSc matrices and vectors (I did not evaluate how severe this actually is). The highest granularity function in the PETSc API lets one add a 2D dense matrix (typically a cell matrix) into the global matrix in one shot, so that it would be more performant that we inject the whole cell matrix. Should we modify Gridap.jl such that the granularity of these operations is increased?
The output generated by NLsolve.jl on screen (triggered with show_trace=true) is shown by all MPI tasks simultaneously, so that the output guests messed up. See, e.g., https://travis-ci.com/github/gridap/GridapDistributed.jl/jobs/359332310#L1008

gridap / GridapDistributed.jl

Issues found/lessons learned with MPI-parallel p-Laplacian solver using `NLsolve.jl` #21