Open tbetcke opened 3 years ago
In the left column on page 6, what are "basis function multipliers" in lines 30-31?
Certain function spaces such as RWG spaces in Maxwell need additional multipliers that depend on the triangle. These are stored here.
In the left column on page 8, maybe an example of how "classic loop parallelism" works in line 33 would be useful. I was wondering how assembling the combination of one test element with all trial elements could still lead to linear complexity.
Each test triangle is assembled together with all trial triangles. We then paralleize over the test triangles. This has been clarified in the text.
In the right column on page 8, why is only AVX used in line 31? As far as I can see, the processor used for the benchmarks supports AVX512, as well.
We have used the default vector width from the OpenCL driver. AVX512 is neither in Intel nor in Pocl enabled by default as it can lead to slower performance than AVX2 due to the significant reduction in the CPU clock rate. Also, on AMD it is not supported at all. We have therefore decided to only present tests with AVX2. In practice, we noticed minor improvements with AVX512.
In the left column on page 11, I was surprised at the works "stack based functions" in lines 29-30. As far as I know, OpenCL assigned registers directly to variables and does not use a stack. The point still stands, of course, since registers are even more efficient than stack-based local variables, as long as there are enough of them available.
I think for the CPU this depends on the implementation and cannot automatically be assumed.
I suggest to include references to previous work on BEM with GPUs, e.g., the paper "GCA-H² matrix compression for electrostatic simulations" by S. Börm and S. Christophersen or the papers "Algorithmic patterns for H-matrices on many-core processors" and "A scalable H-matrix approach for the solution of boundary integral equations on multi-GPU clusters" by P. Zaspel.
These are excellent paper of which we are aware. We have not excluded these and other similar papers (e.g. the work by Rio Yokota's group on FMM and H-Matrices on GPUs) since the focus of this publication is on software design for dense matrix assembly with Python and OpenCL. We are working on publications related to the coupling of Bempp-cl and Exafmm, in which these references are more suitable.
In the left column on page 6, what happens if the global assembled matrix does not fit into, e.g., graphics memory when using a GPU in line 33? Is it possible to split the global matrix into submatrices that fit into graphics memory, and then transfer them to main memory? Given the high computational complexity, would it be possible to overlap memory transfers and computation?
Right now, in this case a memory exception is created and returned to the Python interpreter. Submatrix assembly and swapping would be possible. However, efficient shadowing of memory transfer by computations requires that the computations are at least as long as the memory transfers. From the experiments we have done with simple Laplace operators this was not the case. Assembly is blazingly fast and memory transfer much slower. This may be different with more complex operators such as those from Maxwell.
The reason why we have not focused further on this is that there is not so much relevance for practical applications. On modern CPUs the assembly for everything that can feasibly done dense is fast enough. For anything larger we then want to use FMM or H/H^2 matrix techniques.
[x] I suggest to include references to previous work on BEM with GPUs, e.g., the paper "GCA-H² matrix compression for electrostatic simulations" by S. Börm and S. Christophersen or the papers "Algorithmic patterns for H-matrices on many-core processors" and "A scalable H-matrix approach for the solution of boundary integral equations on multi-GPU clusters" by P. Zaspel.
[x] In the left column on page 3, I would expect "sets" instead of "set" in line 26.
[x] In the left column on page 3, the integral in line 50 should probably be defined on \tau_i and \tau_j instead of \Gamma.
[x] In the right column on page 3, may I suggest "the grid, the spaces, and the operator(s)" in line 10?
[x] In the right column on page 3, lines 17 to 20 appear to work only if the support of the basis functions consists only of \tau_i and \tauj, since only quadrature points from these triangles are used. For more general basis functions, A{ij} would only be an entry of the element stiffness matrix.
[x] In the right column on page 3, I suggest to include references for the "singularity-removing coordinate transformations" mentioned in line 27.
[x] In the left column on page 4, I suggest "dive deep" instead of "deep dive" in line 30.
[x] In the right column on page 4, I suggest "perform as efficiently as possible" instead of "perform as highly as possible" in line 27.
[x] In the left column on page 5, it should be "relationships" instead of "relataionships" in line 11.
[x] In the left column on page 5, I suggest to clarify that OpenCL kernels can not be written in C99 or C++, since important language features like recursion or the standard libraries are missing, while features like vector types have been added. It would be probably better to speak of OpenCL C.
[x] In the right column on page 5, may I suggest "from the host" instead of "from host" in line 21?
[x] In the left column on page 6, I did not understand what "test and trial connectivity information that stores the indices of each node for each triangle" means in lines 25-27. Do the nodes correspond to the basis functions? Is each triangle assigned a global index for each local index? I suggest to explain this aspect in greater detail.
[x] In the left column on page 6, what are "basis function multipliers" in lines 30-31?
[x] In the left column on page 6, what happens if the global assembled matrix does not fit into, e.g., graphics memory when using a GPU in line 33? Is it possible to split the global matrix into submatrices that fit into graphics memory, and then transfer them to main memory? Given the high computational complexity, would it be possible to overlap memory transfers and computation?
[x] In the left column on page 8, may I suggest "coloring techniques" instead of "color coding techniques" in line 6?
[x] In the left column on page 8, maybe an example of how "classic loop parallelism" works in line 33 would be useful. I was wondering how assembling the combination of one test element with all trial elements could still lead to linear complexity.
[x] In the right column on page 8, why is only AVX used in line 31? As far as I can see, the processor used for the benchmarks supports AVX512, as well.
[x] In the left column on page 11, I was surprised at the works "stack based functions" in lines 29-30. As far as I know, OpenCL assigned registers directly to variables and does not use a stack. The point still stands, of course, since registers are even more efficient than stack-based local variables, as long as there are enough of them available.