Review Comments - Associate Editor

tbetcke commented 3 years ago

[x] The technique described by the authors seems to entail a substantial duplication of code that might be avoidable using certain techniques. What trade-offs led the authors to their approach?
[x] Modern FEM codes tend to employ atomic operations instead of graph coloring for improved performance. (Atomic FP operations, also in 64 bit, can be realized via the classical compare-and-swap technique.)
[x] The authors state that SIMD vectorization is not used for the evaluation of singular integrals, on account of the far field representing the lion's share of the work. Under FMM acceleration, near-field and far-field should take approximately equal time. What were the obstacles to applying the optimization in this setting?
[x] The authors describe repeated host-device transfers as a particular performance bottleneck. These are likely avoidable if the data is kept on the device throughout. Why was this route not chosen? This question is particularly salient because it directly influences the authors' disqualification of GPUs for direct evaluation tasks.
[x] The comment on lower FP64 throughput on Nvidia hardware should be removed or rephrased as it only applies to gaming-targeted consumer hardware.
[x] Numba is described as a fallback technology, but no particular scenario is given in which the fallback would be required. The installation instructions for PyOpenCL (https://documen.tician.de/pyopencl/misc.html) appear to contain instructions to deploy a full OpenCL stack on nearly any machine.
[x] When the authors characterize OpenCL as a "second language", to what extent is this second language necessitated by the GPU as a "second environment"?
[x] Roofline plots for cost/performance.

tbetcke commented 3 years ago

The technique described by the authors seems to entail a substantial duplication of code that might be avoidable using certain techniques. What trade-offs led the authors to their approach?

The main code duplication is within OpenCL kernels. Some of this was mitigated through inline functions defined in headers and significant use of define statements. But this does not avoid all code duplication. At the end, the decision was that this was a price worth paying for the advantage of being able to encapsulate expensive computational routines in fast OpenCL kernels.

The redevelopment of kernels using Numba came after OpenCL. This was mainly driven out of curiosity of how well Numba could perform but also to give users an alternative who don't have a well working OpenCL stack on their system (especially Windows and Mac)

tbetcke commented 3 years ago

Modern FEM codes tend to employ atomic operations instead of graph coloring for improved performance. (Atomic FP operations, also in 64 bit, can be realized via the classical compare-and-swap technique.)

We are using the OpenCL 1.2 standard, which supports a compare and swap operation only for 32 bit floats. Otherwise, I agree that atomics are preferable.

tbetcke commented 3 years ago

The authors state that SIMD vectorization is not used for the evaluation of singular integrals, on account of the far field representing the lion's share of the work. Under FMM acceleration, near-field and far-field should take approximately equal time. What were the obstacles to applying the optimization in this setting?

In the preprint https://arxiv.org/pdf/2103.01048.pdf we discuss the combination of Bempp-cl and Exafmm for large Poisson-Boltzmann problems. This has figures on combined assembly time and solver times. Singular integrals were not a significant bottleneck here. We are currently preparing a more detailed paper about fast solver integration into Bempp, where this will be discussed more. Should for certain problem sizes the singular integrals become a bottleneck, we can either manually optimize or shift onto GPU with not much effort. One note though, we are not manually SIMD optimizing. But I believe that due to the structure of our singular integration routines some automatic vectorization potential may be exploited by the compiler.

tbetcke commented 3 years ago

The comment on lower FP64 throughput on Nvidia hardware should be removed or rephrased as it only applies to gaming-targeted consumer hardware.

A lot of our users don't have access to HPC Centre type accelerators. So this is a relevant observation for them. We have clarified in the text though that slow double precision performance is not an issue in data center accelerators.

tbetcke commented 3 years ago

The authors describe repeated host-device transfers as a particular performance bottleneck. These are likely avoidable if the data is kept on the device throughout. Why was this route not chosen? This question is particularly salient because it directly influences the authors' disqualification of GPUs for direct evaluation tasks.

The reason for this is problem sizes vs available memory in most accelerators. Dense discretisation require significant amounts of memory. Barring data centre type accelerators most cards don't have enough RAM to keep dense matrices on them for interesting problem sizes. Many practical computations do not just need a single matrix, but a number of matrices (e.g. transmission probelms), quickly eating up RAM on any but the most expensive accelerators.

But this is an option that is likely coming soon as part of our push to optimise Bempp-cl for cluster computing on HPC systems with MPI.

tbetcke commented 3 years ago

Numba is described as a fallback technology, but no particular scenario is given in which the fallback would be required. The installation instructions for PyOpenCL (https://documen.tician.de/pyopencl/misc.html) appear to contain instructions to deploy a full OpenCL stack on nearly any machine.

Numba has two uses in our library, one essential, and one fallback. The essential use is for all O(n) routines that involve grid iterations. Here, we make extensive use of Numba. For operator assembly it started as experiment to see how a simple Numba implementaton fares against OpenCL. The decision to keep this in the code was mainly for Mac users. On Mac we frequently encountered problems with users having crashes in POCL, and the Apple CPU OpenCL implementation being problematic (workgroup sizes of more than 1 item caused problems for us on CPU with the Apple OpenCL driver). We do not have Macs available to properly sort out these problems, and often users don't want to use Docker images as it makes it more difficult to modify the code and link with external libraries (it is certainly possible, but most users don't have this Docker knowledge). Hence, Mac users can use this fallback quite easily. On Windows this is only relevant to some extent as the Intel OpenCL CPU driver for Windows works well.

tbetcke commented 3 years ago

When the authors characterize OpenCL as a "second language", to what extent is this second language necessitated by the GPU as a "second environment"?

Good point. Personally, I like how libraries such as Numba-Cuda hide the complexity of the second device. But this certainly comes with its own drawbacks.

tbetcke commented 3 years ago

Roofline plots for cost/performance.

A dedicated paper on the dense algorithms for OpenCL is in planning. This will contain much more detailed performance comparisons than possible in the available format and time for the special issue.

tbetcke / cise_bempp

Review Comments - Associate Editor #2