mrirecon / bart

BART: Toolbox for Computational Magnetic Resonance Imaging
https://mrirecon.github.io/bart/
BSD 3-Clause "New" or "Revised" License
295 stars 163 forks source link

3D NIHT fails when num_auto_parallelize = true #246

Open hakkelt opened 3 years ago

hakkelt commented 3 years ago

I'm trying to run the following set of commands:

bart phantom -3 phan
bart fft $(bart bitmask 0 1 2) phan y
bart ones 3 128 128 128 smap
bart pics -R N:$(bart bitmask 0 1 2):0:10 y smap recon

The last command fails with the following assertion error: bart: /home/hakta/.bin/bart/src/num/vecops.c:547: klargest_complex_partsort: Assertion 'k <= N' failed. Similarly, bart pics -R H:$(bart bitmask 0 1 2):0:10 y smap recon also fails with the same error message.

After some debugging, I found the source of the error in /src/num/optimize.c (lines 705-735):

    int skip = min_blockdim(N, ND, tdims, nstr1, sizes);
    unsigned long flags = 0;

    debug_printf(DP_DEBUG4, "MD-Fun. Io: %d Input: ", io);
    debug_print_dims(DP_DEBUG4, D, dim);

#ifdef USE_CUDA
    if (num_auto_parallelize && !use_gpu(N, nptr1) && !one_on_gpu(N, nptr1)) {
#else
    if (num_auto_parallelize) {
#endif
        flags = dims_parallel(N, io, ND, tdims, nstr1, sizes);

        while ((0 != flags) && (ffs(flags) <= skip))
            skip--;

        flags = flags >> skip;
    }

    const long* nstr2[N];

    for (unsigned int i = 0; i < N; i++)
        nstr2[i] = *nstr1[i] + skip;

#ifdef USE_CUDA
    debug_printf(DP_DEBUG4, "This is a %s call\n.", use_gpu(N, nptr1) ? "gpu" : "cpu");

    __block struct nary_opt_data_s data = { md_calc_size(skip, tdims), use_gpu(N, nptr1) ? &gpu_ops : &cpu_ops };
#else
    __block struct nary_opt_data_s data = { md_calc_size(skip, tdims), &cpu_ops };
#endif

As num_auto_parallelize = true, value of skip is decreased from 3 to 2, and the size field of data calculated by md_calc_size(skip, tdims) is incorrect because the 3D NIHT cannot be parallelized along the third dimension (or, at least, this feature is not implemented)... Setting num_auto_parallelize = false at line 574 in /src/num/optimize.c solved the problem, but I also experienced a slight decrease in performance in the FISTA-based reconstructions.

Is it be possible to switch off auto parallelization only for NIHT reconstructions, or is there an easy way to perform NIHT parallel?

uecker commented 3 years ago

Thanks, I will fix this. The problem is that the higher-level function uses the optimization framework incorrectly (as NIHT can not be parallelized easily in this automatic way).

sdimoudi commented 3 years ago

Thank you Martin, I will look into this too from my end and get back to you tomorrow. There is also some GPU support I want to finalise and submit within the next couple of weeks, if it is of interest to Hakkel, I will check the 3D case and get back to you.

uecker commented 3 years ago

I think you can't use optimized_nop for this. You need to reorder dimensions so that all jointly threshold dimensions are innermost and all others (which could be processed in parallel) on the outside by copying in a temporary array (this can be skipped if they are already in this order). You can then reshape the dimensions that the innermost block is collapsed to one dimension. Finally, you use md_parallel_nary to apply the kernel while processing all blocks in parallel but each block must be processed by one kernel application.

hakkelt commented 3 years ago

Thank you both for your help. :) Sofia: Yes, it would be really nice to have this algorithm be implemented on GPU even for the 3D case!