tbetcke / cise_bempp

CISE Bempp paper
1 stars 1 forks source link

Reviewer 2 #4

Open tbetcke opened 3 years ago

tbetcke commented 3 years ago
tbetcke commented 3 years ago

We could tune our code for better auto-vectorization in Numba, but this would give little benefit as we have highly optimized hand- tuned OpenCL kernels already." To me this statement casts a shadow on the interpretation of the benchmarking results since it may not be "fair" to compare highly optimized OpenCL kernels to less well-optimized Numba kernels. I do understand that Numba is meant to be a backup implementation in Bempp, but then the authors should be careful to put these results in context. I ask that they provide some additional discussion on the "fairness" of this comparison and how the results should be interpreted.

This is a fair point and we have emphasised this more in the text. On the last page we have added the following part:

We need to stress that we have performed very few optimisations specific to Numba, while significant optimisation has gone into the OpenCL codes. It is therefore well possible that the performance gap between Numba and OpenCL can be significantly reduced. But from other projects our own anecdotal experience is that the more Numba is optimised, the less Pythonic and more C-like Numba functions look. So while Numba is a very powerful tool, it requires its own techniques for optimisation, different from standard Python code.

tbetcke commented 3 years ago

Since Numba provides support for both CUDA and ROCm, the authors should explicitly state that are using Numba CUDA in the abstract and introduction. It's clear from the context since they are using an NVIDIA GPU but they need to say this to avoid confusion.

Just as clarification. We are not using the GPU functionality of Numba. All our Numba code is CPU code. We have added a corresponding comment to the Numba Assembly section. We have considered a Numba Cuda backend. However, currently this would bring no advantage to the existing OpenCL backend for GPUs as they work on Nvidia and ROCm devices.

tbetcke commented 3 years ago

Having briefly experimented with PyOpenCL/OpenCL myself, I was surprised at the authors' characterization that using it was easy. I struggled to get it configured properly with my NVIDIA vendor driver, for example. Maybe the authors did not have this experience but since OpenCL has a general reputation for being harder to use than CUDA, it would be useful for the authors to acknowledge this and state whether their experience was similar or different.

There are two things here to distinguish. 1.) Installation difficulties of OpenCL. 2.) Complexity of using OpenCL. With regards to OpenCL installation. I am using a Linux laptop with the Nvidia close source drivers and never had any issues. The only small caveat is that if PyOpenCL is installed from conda, one needs to symlink the Nvidia icd from /etc/OpenCL/vendors to the corresponding /etc/ directory in the virtual environment for PyOpenCL to see it. Also, on Windows I did not encounter issues. Indeed, we have users that have used Bempp on Windows with an Nvidia card.

With regards to the complexity of OpenCL itself, here the big advantages of Python bindings come in. While the OpenCL C library is very verbose, the PyOpenCL bindings automate most of the work and allow to launch kernels with just a few lines of code. Since we only ever used OpenCL from Python the verbosity of the C interface to OpenCL was never an issue for us.

Debugging OpenCL can be more of a problem. But I am very used to debugging kernels with simple printf statements. Over time one gets quite proficient at it.

tbetcke commented 3 years ago

This is a more minor comment but since OpenCL was chosen for portability, I would have been very interested to see benchmarking results on AMD GPUs too. If this is future work it is worth mentioning.

We do not have access to modern AMD hardware. I have a workstation with two AMD Polaris generation GPUs. But my Nvidia card is a modern Quadro RTX 3000, which in all our tests significantly outperforms the older AMD Polaris generation. It would therefore be unfair to bring such comparison benchmarks here. We would only demonstrate the age difference between these GPU generations from different vendors.

We are working on a dedicated paper on the OpenCL algorithms which will contain much more detailed benchmarks on different cards with respect to their peak performance. The focus of the current paper was on demonstrating the potential of mixed Python/OpenCL development and less on detailed performance bencmarking of specific algorithms.