omlins / ParallelStencil.jl

Package for writing high-level code for parallel high-performance stencil computations that can be deployed on both GPUs and CPUs
BSD 3-Clause "New" or "Revised" License
311 stars 31 forks source link

Disable subnormals inside @parallel blocks #65

Closed smartalecH closed 1 year ago

smartalecH commented 1 year ago

I'm able to disable subnormals (e.g. set_zero_subnormals(true)) within @parallel_indices blocks, but not parallel blocks. Is there an easy way to get around this? Thanks!

luraess commented 1 year ago

Hi @smartalecH, the parallel function only allows for FiniteDifference{1,2,3}D submodule macros by design to ensure good performance. More advanced features and constructs should currently be implemented in parallel_indices functions as you report.

smartalecH commented 1 year ago

Good to know, thanks!

omlins commented 1 year ago

Hi @smartalecH, here some additional comments to @luraess ' answer.

First, the error message you obtain when you try to add the line set_zero_subnormals(true) into your kernel gives some more information about why it does not work:

ERROR: LoadError: ArgumentError: unsupported kernel statements in @parallel kernel definition: @parallel is only applicable to kernels that contain exclusively array assignments using macros from FiniteDifferences{1|2|3}D or from another compatible computation submodule. @parallel_indices supports any kind of statements in the kernels.

Second, note that set_zero_subnormals(true) can only give a performance improvement for a compute-bound code, whereas stencil-codes are normally memory-bound. The heat flow code given in the performance tips in the Julia docs only benefits from set_zero_subnormals(true) when the solved problem is so small that it fits in some fast cache (1000 Float32, i. e., ~ 4 KB in the example code; if you increase the size of a in the example, e.g., to 1000^2 you should see that the set_zero_subnormals(true) has no more effect...).

Third, note also that set_zero_subnormals does not work for GPU.

Finally, for improving CPU performance for small problems, we have initiated a backend with LoopVectorization. It might well be that within this effort, we can add support for set_zero_subnormals, one way or another.

smartalecH commented 1 year ago

Awesome, thanks for the very thorough followup. I have a few responses inline (if interested).

Second, note that set_zero_subnormals(true) can only give a performance improvement for a compute-bound code,

I would disagree... I guess it depends on how you quantify a performance improvement. We have an FDTD code that significantly benefitted from removing subnormal support. The issue was that as the fields ramped up, each step would take an extremely long time to compute. This was problematic when trying to fine tune simulation parameters on different hardware setups (and you only wanted to run for a few timesteps anyway).

Either way I agree that this is a rather niche improvement.

Finally, for improving CPU performance for small problems, we have initiated a backend with LoopVectorization. It might well be that within this effort, we can add support for set_zero_subnormals, one way or another.

Cool! This sounds intriguing. Do you guys have a PR yet that describes the proposed implementation? I'm eager to help if interested.

omlins commented 1 year ago

I have done some prototpyping with LoopVectorization, but there is nothing concrete in terms of integration of this yet. Also, first some refactoring for streamlining backend addition is needed and the addition of an AMDGPU backend has higher priority at the moment. So, it will take a while until we can make it happen...

smartalecH commented 1 year ago

first some refactoring for streamlining backend addition

Is this goal documented anywhere? I wanted to add Metal support. But I noticed that the backend processing is currently a bit ad hoc, and I also thought some refactoring would be good.

omlins commented 1 year ago

@smartalecH : sorry for the late reply, I have been in vacation. No, this is not documented and the new backend implementations will follow different requirements and ideology than the original one. We will let you know as soon as it is done.

smartalecH commented 1 year ago

Thanks @omlins! Feel free to reach out if I can help in some aspect.