lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.
https://lattice.github.io/quda
Other
279 stars 94 forks source link

Make 1024 the default launch bounds for AMD GPUs #1462

Closed dmcdougall closed 2 months ago

dmcdougall commented 2 months ago

This matches the current default used by the rocm compiler, and this fixes an issue where gauge_alg_double was failing due to picking an invalid launch configuration that satisfied the device max threads per block but violated the 256 launch bounds attribute.

dmcdougall commented 2 months ago

This makes the gauge_alg_double test pass for me.

bjoo commented 2 months ago

I think at one point we had a way to add launch bounds to specific kernels? I believe we dropped originally the 256 thread per block launch bounds to avoid some register spilling issues. I don't remember the exact details but with the default 1024 threads per block, didn't we have an issue of running out of registers which was not just a performance issue?

Would it be possible to add these new bounds to specific kernels that need them?

dmcdougall commented 2 months ago

I think at one point we had a way to add launch bounds to specific kernels?

That's what the template tag appears to be for; to specialise on a particular kernel.

I believe we dropped originally the 256 thread per block launch bounds to avoid some register spilling issues.

Yep, I think that makes sense. If you promise to the compiler that you won't use more than 256 threads per block, for example, the compiler can use this information for optimisation because you're making a statement about worst-case register usage. If you don't specify one, the default is to assume the worst case of 1024 threads per block, and for register-hungry kernels the compiler may be forced to spill even if you never actually launch that many threads per block at runtime.

Hope that makes sense.

Would it be possible to add these new bounds to specific kernels that need them?

Yes, and ideally that's what I'd like to do.

The problem here isn't a performance one. It's a correctness one. None of the launch bounds infrastructure is actually used by QUDA to build launch configs that respect the launch bounds constraints. The only thing that gets used is maxThreadsPerBlock which calls into device props. At present the max threads device prop returns 1024, regardless of what launch bounds you used for the kernel. And since QUDA doesn't know about the launch bounds, it doesn't understand that in cases with a launch bounds of 256 it isn't allowed to a launch that kernel with a fatter thread block.

I have some larger criticisms of the launch bounds infrastructure that probably warrants a conversation over the phone. Right now this infrastructure is device-specific (living in the device namespace), and there are multiple launch bounds methods for [123]D kernels, reduction kernels, and block ortho kernels. This doesn't make sense to me at all. There's no difference between a 1D kernel and a 3D kernel in terms of launch configs for register allocation. The same applies to all the reduction kernels. There's also no real benefit to having this infrastructure when it is entirely compile-time information. I see a bigger benefit to allowing the autotuner to JIT this information as part of the kernel. That's a different conversation, though.

I'm happy to not merge this PR for now, but I'd like to leave it open until we can come to an agreement on what the most appropriate next step is.

maddyscientist commented 2 months ago

@dmcdougall one easy fix we could go with for the failing gauge_alg_test is to jut set the default parameters such that they are guaranteed to work. I suspect the issue is due to some of the gauge fixing kernels: unless specified otherwise the default number of threads in the x dimension is given by the architecture warp (wave) size, however, some variants of the gauge fixing kernels require 8 thread in the y dimension. On AMD we therefore have 8 * 64 = 512 > 256 threads. We could set the default number of threads in the x dimension to a hard-coded 32 as opposed to the warp size. Obviously not a robust fix, but it would get you over this bump in the road.

dmcdougall commented 2 months ago

one easy fix we could go with for the failing gauge_alg_test is to jut set the default parameters such that they are guaranteed to work.

That's what this PR does, but it does it for all the other kernels too.

We could set the default number of threads in the x dimension to a hard-coded 32 as opposed to the warp size. Obviously not a robust fix, but it would get you over this bump in the road.

Just for the gauge fixing kernels, correct? Yes, that's another approach. I just wonder what other things might slip through the cracks.

Happy to do it either way.

maddyscientist commented 2 months ago

@dmcdougall before we merge, can you confirm this fixes the issue you were seeing?

dmcdougall commented 2 months ago

Working on it.

dmcdougall commented 2 months ago

Ok this fixes the GaugeAltTest.Landau_Overrelaxation/double failure I was seeing.

I'm currently running the full suite which will take some time to complete. Feel free to merge and I'll deal with more problems as they arise.

Or if you want to wait for full suite to finish that's fine too.

dmcdougall commented 2 months ago

Ok full test suite done and I don't see any new (unexpected) failures. Feel free to merge at your convenience.