try -fopenmp-cuda-mode flag

t-hishinuma commented 2 years ago

memo:

Clang supports two data-sharing models for Cuda devices: Generic and Cuda modes. The default mode is Generic. Cuda mode can give an additional performance and can be activated using the -fopenmp-cuda-mode flag. In Generic mode all local variables that can be shared in the parallel regions are stored in the global memory. In Cuda mode local variables are not shared between the threads and it is user responsibility to share the required data between the threads in the parallel regions.

https://clang.llvm.org/docs/OpenMPSupport.html#basic-support-for-cuda-devices

t-hishinuma commented 2 years ago

with option

func    prec    size    iter    time[sec]       time/iter[sec]
CG(none,CRS)    float   500     1000    4.70915 0.00470915
CG(none,CRS)    float   1000    1000    6.28332 0.00628332
CG(none,CRS)    float   1500    1000    6.45597 0.00645597
CG(none,CRS)    float   2000    1000    7.16501 0.00716501
CG(none,CRS)    double  500     1000    4.03438 0.00403438
CG(none,CRS)    double  1000    1000    6.47584 0.00647584
CG(none,CRS)    double  1500    1000    7.15447 0.00715447
CG(none,CRS)    double  2000    1000    7.84225 0.00784225

without option

func    prec    size    iter    time[sec]       time/iter[sec]
CG(none,CRS)    float   500     1000    4.38381 0.00438381
CG(none,CRS)    float   1000    1000    5.96086 0.00596086
CG(none,CRS)    float   1500    1000    6.16371 0.00616371
CG(none,CRS)    float   2000    1000    6.96616 0.00696616
CG(none,CRS)    double  500     1000    3.90027 0.00390027
CG(none,CRS)    double  1000    1000    6.31521 0.00631521
CG(none,CRS)    double  1500    1000    7.07398 0.00707398
CG(none,CRS)    double  2000    1000    7.96718 0.00796718

???

it is not good..?

t-hishinuma commented 2 years ago

done

ricosjp / monolish

try -fopenmp-cuda-mode flag #95