mrirecon / bart

BART: Toolbox for Computational Magnetic Resonance Imaging
https://mrirecon.github.io/bart/
BSD 3-Clause "New" or "Revised" License
290 stars 163 forks source link

md_copy2 improvement #151

Closed smyeung closed 5 years ago

smyeung commented 6 years ago

Hi Team,

Using BART to recon 3D MRI data with dimension 256 x 256 x 90, 16 channel (compressed to 7 virtual channel) data. Using nvvp to analyze the performance of the "bart pics" command to recon, and on my system (Debian on Intel 8700k cpu with 64GB RAM & GTX 1080 Ti), more than 90% of the time is used in memory copy and the kernel occupancy is less than 10%.

E.g. the following l1-wavelet 3D recon: "bart pics -l1 -i 30 -S -r 0.1 -g -G 0 ./ksp_cc ./map_cc ./ img_cc" takes ~26s (5.8s before GPU recon (upto sense_init), 20s for GPU recon). Traced the source code and figured out the performance is bounded by md_copy2 function, which perform a cudaMemcpy2D only when (ND-Skip==1).

After studied the FIXME comment about memory alignment on md_copy2, we began re-write the code for the usegpu==1 case. When ND-Skip==0, a single cudaMemcpy will copy the the data 1 time. For the case ND-Skip >= 1, without using the NESTED macro and the corresponding nd function and use an explicit nested loop (with some modifications on the original code) to call the cudaMemcpy2D function, we are able to successfully recon & cut the recon time from 26s down to 9.5s (ie 5.8s before GPU recon and 3.7s for recon), memcpy reduced to less than 25% and kernel occupancy went up to 75%.
screen shot 2018-09-03 at 1 07 37 am

A closer look of 1 iteration look like this (the kernel look more busy ;)): screen shot 2018-09-03 at 1 46 46 am

The modified version seems fine for the workshop example and some other demo code as well.

Let me touchup the code, and see if we can combined with the NESTED macro and share with you later. After this, I am thinking to make some improvement on

Cheers, Simon

uecker commented 6 years ago

Thank you! Please note that you have to release your contribution under the BSD license for us to be able to use it in BART. Looking forward to your contribution! If you plan larger changes, it is a good idea to discuss this with me first...

smyeung commented 6 years ago

Hi,

Thanks for the reminder and yes we can contribute this under the BSD license.

Yes, the md_ function series is one of the core and need dedicated handling and really don’t want to mess up the the original beautiful source code. At the moment, I am using BART to recon previous dataset. I am thinking to identify the bottleneck I am facing and try to optimize them.

It will be nice to discuss with you and other contributors and make BART an even better tools!

BTW, progress update: implemented a Cuda kernel level looped strided memcpy for higher dimensional array (3D, 4D, 5D, and 6D). The correspond kernels’ occupancy is 99%. Vs cudaMemcpy2D, the performance only improved by 3-4% in my case.

Best, Simon Sent from my iPhone

On 3 Sep 2018, at 02:30, Martin Uecker notifications@github.com wrote:

Thank you! Please note that you have to release your contribution under the BSD license for us to be able to use it in BART. Looking forward to your contribution! If you plan larger changes, it is a good idea to discuss this with me first...

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

smyeung commented 6 years ago

Hi Team,

Here's the implementation explanation: 1) In ./src/num/multind.c

Attached is the modified files mentioned above. Please kindly let us know if it works for you or not. Also let us know if you encounter any problem using it and we will try our best to improve it accordingly.

Best, Simon

On Tue, Sep 4, 2018 at 12:16 PM, Simon Yeung simon.yeung@time-med.com wrote:

Hi,

Thanks for the reminder and yes we can contribute this under the BSD license.

Yes, the md_ function series is one of the core and need dedicated handling and really don’t want to mess up the the original beautiful source code. At the moment, I am using BART to recon previous dataset. I am thinking to identify the bottleneck I am facing and try to optimize them.

It will be nice to discuss with you and other contributors and make BART an even better tools!

BTW, progress update: implemented a Cuda kernel level looped strided memcpy for higher dimensional array (3D, 4D, 5D, and 6D). The correspond kernels’ occupancy is 99%. Vs cudaMemcpy2D, the performance only improved by 3-4% in my case.

Best, Simon Sent from my iPhone

On 3 Sep 2018, at 02:30, Martin Uecker notifications@github.com wrote:

Thank you! Please note that you have to release your contribution under the BSD license for us to be able to use it in BART. Looking forward to your contribution! If you plan larger changes, it is a good idea to discuss this with me first...

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mrirecon/bart/issues/151#issuecomment-417949928, or mute the thread https://github.com/notifications/unsubscribe-auth/AgXSwujdqTisyx0GoFRcJWLcoI7_xrp8ks5uXCPZgaJpZM4WWszy .

smyeungx commented 6 years ago

Here's the attachment: github.bart.zip

smyeungx commented 6 years ago

Hi All,

Find out what's the problem in the original code. It's due to the pointer (skip+1) behavior for 2D memcpy apply to the nested loop need to apply to stride too. Check this out. const long nstr[2] = { nstr2[0] + (skip + 1), *nstr2[1] + (skip + 1) };

You can refer to the following modified code (for easy comparison, the skip++ is commented out and (skip+1) is applied to md_nary):

if (use_gpu && (ND - skip >= 1)) { // FIXME: the test was > 0 which would optimize transpose // but failes in the second cuda_memcpy_strided call // probably because of alignment restrictions const long nstr[2] = { nstr2[0] + (skip + 1), *nstr2[1] + (skip + 1) };

    void* nptr[2] = { optr, (void*)iptr };

    long sizes[2] = { md_calc_size(skip, tdims) * size, tdims[skip] };
    long ostr2 = (*nstr2[0])[skip];
    long istr2 = (*nstr2[1])[skip];

    //skip++;

    long* sizesp = sizes; // because of clang

    NESTED(void, nary_strided_copy, (void* ptr[]))
    {
    //  printf("CUDA 2D copy %ld %ld %ld %ld %ld %ld\n", data->sizes[0], data->sizes[1], data->ostr, data->istr, (long)ptr[0], (long)ptr[1]);

        cuda_memcpy_strided(sizesp, ostr2, ptr[0], istr2, ptr[1]);
    };

    md_nary(2, ND - (skip + 1), tdims + (skip + 1) , nstr, nptr, nary_strided_copy);
    return;
}
uecker commented 6 years ago

Thanks Simon,

I will fix it.

But I assume your copy routine is more flexible? I will look at it as soon as I can.

Best, Martin

Am Dienstag, den 04.09.2018, 21:38 -0700 schrieb smyeungx:

Hi All, Find out what's the problem in the original code. It's due to the pointer (skip+1) behavior for 2D memcpy apply to the nested loop need to apply to stride too.  Check this out. const long nstr[2] = { nstr2[0] + (skip + 1), *nstr2[1] + (skip + 1) }; You can refer to the following modified code (for easy comparison, the skip++ is commented out and (skip+1) is applied to md_nary): if (use_gpu && (ND - skip >= 1)) { // FIXME: the test was > 0 which would optimize transpose // but failes in the second cuda_memcpy_strided call // probably because of alignment restrictions

const long nstr[2] = { nstr2[0] + (skip + 1), *nstr2[1] +

(skip + 1) };

void nptr[2] = { optr, (void)iptr };

long sizes[2] = { md_calc_size(skip, tdims) size, tdims[skip] }; long ostr2 = (nstr2[0])[skip]; long istr2 = (*nstr2[1])[skip];

//skip++;

long* sizesp = sizes; // because of clang

NESTED(void, nary_strided_copy, (void* ptr[])) { // printf("CUDA 2D copy %ld %ld %ld %ld %ld %ld\n", data->sizes[0], data->sizes[1], data->ostr, data->istr, (long)ptr[0], (long)ptr[1]);

  cuda_memcpy_strided(sizesp, ostr2, ptr[0], istr2,

ptr[1]); };

md_nary(2, ND - (skip + 1), tdims + (skip + 1) , nstr,

nptr, nary_strided_copy); return; } — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

smyeungx commented 6 years ago

Hi Martin,

Thanks.

I am fine touching the Device -> Device strided memcpy cu code. That will be several simple cu kernel level / multi-dimension / explicit nested / variable copy code. As in my application, it only takes up to ND-size=4. So I implement it up to a 8D just in case I may encounter this later. The additional performance gain (on top of 2D strided copy) for 3D and beyond strided copy is only a few percent in these cases (vs ~18x improvement in l1-wavelet recon by using the corrected original code).

Our application need almost reatime recon with parallel imaging with recon and display, and since recon process can go down from 26s to 7s, every sec is valuable to us. I spotted that there's another simple improvement in the l1-wavelet pipeline. Let me open another ticket and discuss with you.

Simon

uecker commented 5 years ago

I assume this can be closed. Thank you for your help.