tbetcke / cise_bempp

CISE Bempp paper
1 stars 1 forks source link

Reviewer 1 #3

Open tbetcke opened 3 years ago

tbetcke commented 3 years ago
tbetcke commented 3 years ago

In the left column on page 6, what are "basis function multipliers" in lines 30-31?

Certain function spaces such as RWG spaces in Maxwell need additional multipliers that depend on the triangle. These are stored here.

tbetcke commented 3 years ago

In the left column on page 8, maybe an example of how "classic loop parallelism" works in line 33 would be useful. I was wondering how assembling the combination of one test element with all trial elements could still lead to linear complexity.

Each test triangle is assembled together with all trial triangles. We then paralleize over the test triangles. This has been clarified in the text.

tbetcke commented 3 years ago

In the right column on page 8, why is only AVX used in line 31? As far as I can see, the processor used for the benchmarks supports AVX512, as well.

We have used the default vector width from the OpenCL driver. AVX512 is neither in Intel nor in Pocl enabled by default as it can lead to slower performance than AVX2 due to the significant reduction in the CPU clock rate. Also, on AMD it is not supported at all. We have therefore decided to only present tests with AVX2. In practice, we noticed minor improvements with AVX512.

tbetcke commented 3 years ago

In the left column on page 11, I was surprised at the works "stack based functions" in lines 29-30. As far as I know, OpenCL assigned registers directly to variables and does not use a stack. The point still stands, of course, since registers are even more efficient than stack-based local variables, as long as there are enough of them available.

I think for the CPU this depends on the implementation and cannot automatically be assumed.

tbetcke commented 3 years ago

I suggest to include references to previous work on BEM with GPUs, e.g., the paper "GCA-H² matrix compression for electrostatic simulations" by S. Börm and S. Christophersen or the papers "Algorithmic patterns for H-matrices on many-core processors" and "A scalable H-matrix approach for the solution of boundary integral equations on multi-GPU clusters" by P. Zaspel.

These are excellent paper of which we are aware. We have not excluded these and other similar papers (e.g. the work by Rio Yokota's group on FMM and H-Matrices on GPUs) since the focus of this publication is on software design for dense matrix assembly with Python and OpenCL. We are working on publications related to the coupling of Bempp-cl and Exafmm, in which these references are more suitable.

tbetcke commented 3 years ago

In the left column on page 6, what happens if the global assembled matrix does not fit into, e.g., graphics memory when using a GPU in line 33? Is it possible to split the global matrix into submatrices that fit into graphics memory, and then transfer them to main memory? Given the high computational complexity, would it be possible to overlap memory transfers and computation?

Right now, in this case a memory exception is created and returned to the Python interpreter. Submatrix assembly and swapping would be possible. However, efficient shadowing of memory transfer by computations requires that the computations are at least as long as the memory transfers. From the experiments we have done with simple Laplace operators this was not the case. Assembly is blazingly fast and memory transfer much slower. This may be different with more complex operators such as those from Maxwell.

The reason why we have not focused further on this is that there is not so much relevance for practical applications. On modern CPUs the assembly for everything that can feasibly done dense is fast enough. For anything larger we then want to use FMM or H/H^2 matrix techniques.