mumax / 3

GPU-accelerated micromagnetic simulator
Other
447 stars 150 forks source link

panic: CURAND_STATUS_LENGTH_NOT_MULTIPLE issue when grid size doesn't have small prime factors EDIT: when grid size odd and temperature finite #314

Open alexanderwelbourne opened 1 year ago

alexanderwelbourne commented 1 year ago

The engine throws a hard to dechiper error when the grid size doesn't have small prime factors. In our case using: SetGridsize(625, 625, 1) returned, panic: CURAND_STATUS_LENGTH_NOT_MULTIPLE

I believe this is related to 625 only having 5 as a prime factor. Changing to, for example, 650 fixed the panic. I'm aware that using small prime factors is always going to be better for performance anyway, but perhaps there might be a better way to catch this error and report to the user what the should change.

jplauzie commented 1 year ago

Hello,

Are you by chance using a nonzero temperature in your script? That error usually pops up with temperature, because curand requires an even number for the RNG generation (it's tied to the PRNG method it uses). Not coincidentally, 650 is even. If you run it without temperature, a 625 gridsize should work just fine.

(There are also some restrictions on mesh size, even without temperature, but they are much rarer to hit than the curand restrictions. I mention it because funnily enough, 626 is one of them, with prime factors 2 and 313. But they give a different error (CUDA_ERROR_INVALID_VALUE), and you probably won't notice it until you use something with a rather large prime factor. In my quick testing, it seems fine as long as you don't use prime factors larger than 127 (128 is 256/2) or so, so it's rare to run into. Skimming the cu_FFT documentation, this seems to be when cu_FFT switches from the Cooley-Tukey FFT algorithm to Bluestein, so I would guess there is something tied to that. For example, 624 works just fine, which has prime factors of 2,3,13. As does 143, with prime factors 11 and 13. 262, with factors 2 and 131, does not. etc)

I think if you stick to 'nice' numbers which are both even and 7smooth, you should be ok.

It would be good if this was communicated better to the user in the error message though, currently it is a bit buried on the mumax boards. Especially with temperature needing an even gridsize , that is common enough to run into.

Cheers, Josh L.

alexanderwelbourne commented 1 year ago

Thanks for the very thorough debugging! We were indeed using nonzero temperature. I agree that an error message would be helpful (I guess it could be implemented relatively straightforwardly in the case of temperature and odd gridsizes), although at least people running into it might find this issue and your very informative answer. Shall I leave the issue open pending adding an error message?

Many thanks! Alex