mreineck / ducc

Fork of https://gitlab.mpcdf.mpg.de/mtr/ducc to simplify external contributions
GNU General Public License v2.0
13 stars 12 forks source link

Fix thread_count computation. #30

Closed cantonios closed 3 months ago

cantonios commented 3 months ago

Down in multi_iter, we attempt to compute calcShare(nshares, myshare, 0, rem). If rem < nshares, then we have too many shares (i.e. threads), resulting in lo == hi, which then triggers the assertion

MR_assert(lo==0, "must not happen");

Copying the rem calculation to thread_count and limiting the number of threads to this value allows us to compute a valid number of shares.

mreineck commented 3 months ago

I have committed a tentative fix with some more sanity checks. Could you please check whether the current ducc0 HEAD fixes the issue?

cantonios commented 3 months ago

I have committed a tentative fix with some more sanity checks. Could you please check whether the current ducc0 HEAD fixes the issue?

Yes, thanks, that fixes the issue.

mreineck commented 3 months ago

Maybe I'm on the entirely wrong track, but just to have mentioned it: you stated at some point that the case triggering the problem was a "typical case" for you. If I analyzed the problem correctly, it should only have turned up for 1D transforms on non-contiguous arrays. Is it possible that you are submitting many 1D FFTs separately where a collective call (e.g. on a 2D array with axes=(0,)) would also have worked? If so, please consider submitting the FFTs in the largest possible batches; this should really help performance a lot!

cantonios commented 3 months ago

Maybe I'm on the entirely wrong track, but just to have mentioned it: you stated at some point that the case triggering the problem was a "typical case" for you. If I analyzed the problem correctly, it should only have turned up for 1D transforms on non-contiguous arrays. Is it possible that you are submitting many 1D FFTs separately where a collective call (e.g. on a 2D array with axes=(0,)) would also have worked? If so, please consider submitting the FFTs in the largest possible batches; this should really help performance a lot!

We saw the issue on tests of 1D data with in_stride = 1. For example, calling ducc0::r2c with

input shape: {32768}
input stride: {1}
output shape: {16385}
output stride: {1}
axes: {0}
forward: true
scale: 1
mreineck commented 3 months ago

Of course! Sorry, I was only looking at "symmetric" transforms (complex-to-complex and real-to-real), and there the "in-place" special treatment will always kick in when arrays are contiguous. With mixed data types the algorithm would still end up on the code path with the bug.

Thanks for the explanation!