Compiling Grid for AMD GPUS

philomat commented 3 years ago

I know the wiki says there is currently no support for AMD GPUs. But I saw commits concerning HIP. Is there a way one could try experimenting with Grid on AMD GPUs?

paboyle commented 3 years ago

yes - hip is believed working but not efficient for AMD GPUs You might try Benchmark_dwf_fp32 and the --dslash-unroll flag ; new of a few days ago.

paboyle commented 3 years ago

Status of multi-GPU and "nvlink" equivalent is untested. --enable-shm=none and MPI between GPU's is probably safer.

paboyle commented 3 years ago

BTW, I have benchmarked AMD MI50 and MI100, but want to revisit with the new explicit Nc=3 kernel.

I have also compiled under HIP on Summit for Nvidia, and got the same performance as Cuda compile.

philomat commented 3 years ago

I was able to compile grid, and also to run the the benchmark you suggested. However some of the test are failing, e.g. The Test_wilson_clover fails with:

Grid : Message : MemoryManager::Init() Using hipMalloc
Grid : Message : 0.335145 s : Grid is setup to use 1 threads
Grid : Message : 0.335153 s : Grid floating point word size is REALF4
Grid : Message : 0.335154 s : Grid floating point word size is REALD8
Grid : Message : 0.335155 s : Grid floating point word size is REAL8
Memory access fault by GPU node-2 (Agent handle: 0x1f04be0) on address 0x7f3e57b92000. Reason: Page not present or supervisor privilege.
[qcd20g01:2774949] *** Process received signal ***
[qcd20g01:2774949] Signal: Aborted (6)
[qcd20g01:2774949] Signal code:  (-6)
[qcd20g01:2774949] [ 0] /lib64/libpthread.so.0(+0x12dd0)[0x7f3d7c01bdd0]
[qcd20g01:2774949] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7f3d7a6c170f]
[qcd20g01:2774949] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f3d7a6abb25]
[qcd20g01:2774949] [ 3] /opt/rocm-3.9.0/lib/libhsa-runtime64.so.1(+0x1bd2b)[0x7f3d796d2d2b]
[qcd20g01:2774949] [ 4] /opt/rocm-3.9.0/lib/libhsa-runtime64.so.1(+0x61f4d)[0x7f3d79718f4d]
[qcd20g01:2774949] [ 5] /opt/rocm-3.9.0/lib/libhsa-runtime64.so.1(+0x1fd97)[0x7f3d796d6d97]
[qcd20g01:2774949] [ 6] /lib64/libpthread.so.0(+0x82de)[0x7f3d7c0112de]
[qcd20g01:2774949] [ 7] /lib64/libc.so.6(clone+0x43)[0x7f3d7a785e83]
[qcd20g01:2774949] *** End of error message ***
Aborted (core dumped)

Or Test_nersc_io fails because the plaquette is not correctly reproduced:

Grid : Message : 2.770978 s : NERSC Configuration ./ckpoint_lat.4000 checksum 5c0ac22a header   5c0ac22a
Grid : Message : 2.770987 s : NERSC Configuration ./ckpoint_lat.4000 plaquette 9.83463e-05 header    0.0507244
Grid : Message : 2.771006 s : NERSC Configuration ./ckpoint_lat.4000 link_trace 0.240144 header    0.000115627
 Plaquette mismatch 
Test_nersc_io: /home/scior/Grid/Grid/parallelIO/NerscIO.h:201: static void Grid::NerscIO::readConfiguration(Grid::NerscIO::GaugeField &, Grid::FieldMetaData &, std::string, GaugeStats) [GaugeStats = Grid::GaugeStatistics<Grid::PeriodicGaugeImpl<Grid::GaugeImplTypes<Grid::Grid_simd<thrust::complex<double>, Grid::GpuVector<4, Grid::GpuComplex<HIP_vector_type<double, 2>>>>, 3, 12>>>]: Assertion `fabs(clone.plaquette -header.plaquette ) < 1.0e-5' failed.
[qcd20g01:2775537] *** Process received signal ***
[qcd20g01:2775537] Signal: Aborted (6)
[qcd20g01:2775537] Signal code:  (-6)
[qcd20g01:2775537] [ 0] /lib64/libpthread.so.0(+0x12dd0)[0x7f68a70c8dd0]
[qcd20g01:2775537] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7f68a576e70f]
[qcd20g01:2775537] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f68a5758b25]
[qcd20g01:2775537] [ 3] /lib64/libc.so.6(+0x219f9)[0x7f68a57589f9]
[qcd20g01:2775537] [ 4] /lib64/libc.so.6(+0x2fcc6)[0x7f68a5766cc6]
[qcd20g01:2775537] [ 5] ./Test_nersc_io[0x40ef28]
[qcd20g01:2775537] [ 6] ./Test_nersc_io[0x40736a]
[qcd20g01:2775537] [ 7] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f68a575a6a3]
[qcd20g01:2775537] [ 8] ./Test_nersc_io[0x40602e]
[qcd20g01:2775537] *** End of error message ***
Aborted (core dumped)

Other test like, e.g. Test_wilson_even_odd seem to work fine.

The configure command I used is:

../configure --enable-unified=no --enable-shm=no --enable-accelerator=hip --enable-comms=mpi3-auto --enable-simd=GPU --enable-gen-simd-width=64 CXX=/opt/rocm-3.9.0/bin/hipcc MPICXX=mpicxx CXXFLAGS=-fPIC -I/opt/rocm-3.9.0/ -I/home/scior/Thrust/ --with-lime=../depencencies/lime

paboyle commented 3 years ago

Thanks - haven't tried WilsonClover on GPU to be honest, so not absolutely sure if tit works on Nvidia either.

Re. the plaquette - this does work on CUDA, so something interesting to look at on HIP..... Where are you running this?

paboyle commented 3 years ago

HIP is definitely in the "experimental" category for now, but getting everything to work would be good. Glad to see you are running on rocm.3.9 which is recent/up to date.

philomat commented 3 years ago

I am running on a machine at JLab

paboyle commented 3 years ago

I should have asked what specifically is the hardware you are running on, rather than physically where is it is located.

philomat commented 3 years ago

It's a machine equipped with 4 Vega 20 cards and an AMD Epyc CPU

paboyle commented 3 years ago

can you tell me the performance you get with

benchmarks/Benchmark_dwf_fp32 --grid 16.16.16.16

and

benchmarks/Benchmark_dwf_fp32 --grid 16.16.16.16 --dslash-unroll

Thanks

philomat commented 3 years ago

Here are the results for Benchmark_dwf_fp32 --grid 16.16.16.16:

rid : Message : 3.846120 s : *****************************************************************
Grid : Message : 3.846121 s : * Benchmarking DomainWallFermionR::Dhop                  
Grid : Message : 3.846123 s : * Vectorising space-time by 8
Grid : Message : 3.846124 s : * VComplexF size is 64 B
Grid : Message : 3.846126 s : * SINGLE precision 
Grid : Message : 3.846127 s : * Using GENERIC Nc WilsonKernels
Grid : Message : 3.846128 s : *****************************************************************
Grid : Message : 3.989939 s : Called warmup
Grid : Message : 8.192125 s : Called Dw 1000 times in 4.20216e+06 us
Grid : Message : 8.192172 s : mflop/s =   329383
Grid : Message : 8.192175 s : mflop/s per rank =  329383
Grid : Message : 8.192177 s : mflop/s per node =  329383
Grid : Message : 8.192179 s : RF  GiB/s (base 2) =   669.298
Grid : Message : 8.192181 s : mem GiB/s (base 2) =   418.311
Grid : Message : 8.192877 s : norm diff   1.08494e-16
Grid : Message : 8.222942 s : #### Dhop calls report 
Grid : Message : 8.222945 s : WilsonFermion5D Number of DhopEO Calls   : 2002
Grid : Message : 8.222948 s : WilsonFermion5D TotalTime   /Calls        : 2106.55 us
Grid : Message : 8.222950 s : WilsonFermion5D CommTime    /Calls        : 0.838162 us
Grid : Message : 8.222952 s : WilsonFermion5D FaceTime    /Calls        : 0 us
Grid : Message : 8.222954 s : WilsonFermion5D ComputeTime1/Calls        : 2105.66 us
Grid : Message : 8.222956 s : WilsonFermion5D ComputeTime2/Calls        : 0 us
Grid : Message : 8.222959 s : Average mflops/s per call                : 334643
Grid : Message : 8.222961 s : Average mflops/s per call per rank       : 334643
Grid : Message : 8.222963 s : Average mflops/s per call per node       : 334643
Grid : Message : 8.222971 s : Average mflops/s per call (full)         : 334501
Grid : Message : 8.222973 s : Average mflops/s per call per rank (full): 334501
Grid : Message : 8.222975 s : Average mflops/s per call per node (full): 334501
Grid : Message : 8.222977 s : WilsonFermion5D Stencil
Grid : Message : 8.222980 s :  Stencil calls 1001
Grid : Message : 8.222989 s :  Stencil halogtime 0.0679321
Grid : Message : 8.222991 s :  Stencil gathertime 0
Grid : Message : 8.222993 s :  Stencil gathermtime 0
Grid : Message : 8.222997 s :  Stencil mergetime 0.0649351
Grid : Message : 8.223033 s :  Stencil decompresstime 0.0759241
Grid : Message : 8.223040 s : WilsonFermion5D StencilEven
Grid : Message : 8.223046 s : WilsonFermion5D StencilOdd
Grid : Message : 8.223049 s : WilsonFermion5D Stencil     Reporti()
Grid : Message : 8.223050 s :  timer0 (HaloGatherOpt) 0.525475
Grid : Message : 8.223053 s :  timer1 (Communicate)   0.0464535
Grid : Message : 8.223055 s :  timer2 (CommsMerge )   0.0774226
Grid : Message : 8.223059 s :  timer3 (commsMergeShm) 0.137363
Grid : Message : 8.223061 s : WilsonFermion5D StencilEven Reporti()
Grid : Message : 8.223062 s : WilsonFermion5D StencilOdd  Reporti()
Grid : Message : 8.579627 s : Compare to naive wilson implementation Dag to verify correctness
Grid : Message : 8.579647 s : Called DwDag
Grid : Message : 8.579648 s : norm dag result 0.0116481
Grid : Message : 8.589324 s : norm dag ref    0.0116857
Grid : Message : 8.599907 s : norm dag diff   7.42172e-17
Grid : Message : 8.620787 s : Calling Deo and Doe and //assert Deo+Doe == Dunprec
Grid : Message : 8.663312 s : src_e0.000481094
Grid : Message : 8.672821 s : src_o0.000478626
Grid : Message : 8.681868 s : *********************************************************
Grid : Message : 8.681870 s : * Benchmarking DomainWallFermionF::DhopEO                
Grid : Message : 8.681871 s : * Vectorising space-time by 8
Grid : Message : 8.681873 s : * SINGLE precision 
Grid : Message : 8.681874 s : * Using GENERIC Nc WilsonKernels
Grid : Message : 8.681875 s : *********************************************************
Grid : Message : 10.828743 s : Deo mflop/s =   323085
Grid : Message : 10.828758 s : Deo mflop/s per rank   323085
Grid : Message : 10.828760 s : Deo mflop/s per node   323085
Grid : Message : 10.828762 s : #### Dhop calls report 
Grid : Message : 10.828763 s : WilsonFermion5D Number of DhopEO Calls   : 1001
Grid : Message : 10.828765 s : WilsonFermion5D TotalTime   /Calls        : 2144.66 us
Grid : Message : 10.828767 s : WilsonFermion5D CommTime    /Calls        : 1.67133 us
Grid : Message : 10.828769 s : WilsonFermion5D FaceTime    /Calls        : 0 us
Grid : Message : 10.828771 s : WilsonFermion5D ComputeTime1/Calls        : 2142.9 us
Grid : Message : 10.828773 s : WilsonFermion5D ComputeTime2/Calls        : 0 us
Grid : Message : 10.828776 s : Average mflops/s per call                : 328827
Grid : Message : 10.828781 s : Average mflops/s per call per rank       : 328827
Grid : Message : 10.828783 s : Average mflops/s per call per node       : 328827
Grid : Message : 10.828787 s : Average mflops/s per call (full)         : 328556
Grid : Message : 10.828789 s : Average mflops/s per call per rank (full): 328556
Grid : Message : 10.828792 s : Average mflops/s per call per node (full): 328556
Grid : Message : 10.828794 s : WilsonFermion5D Stencil
Grid : Message : 10.828799 s : WilsonFermion5D StencilEven
Grid : Message : 10.828802 s : WilsonFermion5D StencilOdd
Grid : Message : 10.828805 s :  Stencil calls 1001
Grid : Message : 10.828808 s :  Stencil halogtime 0.0629371
Grid : Message : 10.828814 s :  Stencil gathertime 0
Grid : Message : 10.828817 s :  Stencil gathermtime 0
Grid : Message : 10.828819 s :  Stencil mergetime 0.0659341
Grid : Message : 10.828822 s :  Stencil decompresstime 0.0589411
Grid : Message : 10.828824 s : WilsonFermion5D Stencil     Reporti()
Grid : Message : 10.828827 s : WilsonFermion5D StencilEven Reporti()
Grid : Message : 10.828829 s : WilsonFermion5D StencilOdd  Reporti()
Grid : Message : 10.828830 s :  timer0 (HaloGatherOpt) 1.04895
Grid : Message : 10.828833 s :  timer1 (Communicate)   0.0889111
Grid : Message : 10.828835 s :  timer2 (CommsMerge )   0.150849
Grid : Message : 10.828838 s :  timer3 (commsMergeShm) 0.293706
Grid : Message : 10.840104 s : r_e0.00576327
Grid : Message : 10.845382 s : r_o12.0372
Grid : Message : 10.850773 s : res0.011619
Grid : Message : 10.911593 s : norm diff   0
Grid : Message : 10.967415 s : norm diff even  0
Grid : Message : 10.976614 s : norm diff odd   0

and here for Benchmark_dwf_fp32 --grid 16.16.16.16 --dslash-unroll:

Grid : Message : 3.841540 s : *****************************************************************
Grid : Message : 3.841541 s : * Benchmarking DomainWallFermionR::Dhop                  
Grid : Message : 3.841542 s : * Vectorising space-time by 8
Grid : Message : 3.841543 s : * VComplexF size is 64 B
Grid : Message : 3.841546 s : * SINGLE precision 
Grid : Message : 3.841547 s : * Using Nc=3       WilsonKernels
Grid : Message : 3.841548 s : *****************************************************************
Grid : Message : 3.990716 s : Called warmup
Grid : Message : 9.631144 s : Called Dw 1000 times in 5.64041e+06 us
Grid : Message : 9.631178 s : mflop/s =   245394
Grid : Message : 9.631181 s : mflop/s per rank =  245394
Grid : Message : 9.631183 s : mflop/s per node =  245394
Grid : Message : 9.631185 s : RF  GiB/s (base 2) =   498.634
Grid : Message : 9.631187 s : mem GiB/s (base 2) =   311.646
Grid : Message : 9.631881 s : norm diff   9.95688e-14
Grid : Message : 9.661613 s : #### Dhop calls report 
Grid : Message : 9.661618 s : WilsonFermion5D Number of DhopEO Calls   : 2002
Grid : Message : 9.661621 s : WilsonFermion5D TotalTime   /Calls        : 2825.97 us
Grid : Message : 9.661624 s : WilsonFermion5D CommTime    /Calls        : 0.883117 us
Grid : Message : 9.661626 s : WilsonFermion5D FaceTime    /Calls        : 0 us
Grid : Message : 9.661628 s : WilsonFermion5D ComputeTime1/Calls        : 2825.03 us
Grid : Message : 9.661630 s : WilsonFermion5D ComputeTime2/Calls        : 0 us
Grid : Message : 9.661635 s : Average mflops/s per call                : 249428
Grid : Message : 9.661637 s : Average mflops/s per call per rank       : 249428
Grid : Message : 9.661642 s : Average mflops/s per call per node       : 249428
Grid : Message : 9.661645 s : Average mflops/s per call (full)         : 249346
Grid : Message : 9.661714 s : Average mflops/s per call per rank (full): 249346
Grid : Message : 9.661718 s : Average mflops/s per call per node (full): 249346
Grid : Message : 9.661723 s : WilsonFermion5D Stencil
Grid : Message : 9.661727 s :  Stencil calls 1001
Grid : Message : 9.661730 s :  Stencil halogtime 0.0729271
Grid : Message : 9.661732 s :  Stencil gathertime 0
Grid : Message : 9.661736 s :  Stencil gathermtime 0
Grid : Message : 9.661738 s :  Stencil mergetime 0.0549451
Grid : Message : 9.661740 s :  Stencil decompresstime 0.0589411
Grid : Message : 9.661747 s : WilsonFermion5D StencilEven
Grid : Message : 9.661751 s : WilsonFermion5D StencilOdd
Grid : Message : 9.661756 s : WilsonFermion5D Stencil     Reporti()
Grid : Message : 9.661757 s :  timer0 (HaloGatherOpt) 0.566933
Grid : Message : 9.661761 s :  timer1 (Communicate)   0.045954
Grid : Message : 9.661764 s :  timer2 (CommsMerge )   0.0789211
Grid : Message : 9.661768 s :  timer3 (commsMergeShm) 0.147353
Grid : Message : 9.661770 s : WilsonFermion5D StencilEven Reporti()
Grid : Message : 9.661772 s : WilsonFermion5D StencilOdd  Reporti()
Grid : Message : 10.187450 s : Compare to naive wilson implementation Dag to verify correctness
Grid : Message : 10.187640 s : Called DwDag
Grid : Message : 10.187650 s : norm dag result 12546.2
Grid : Message : 10.284430 s : norm dag ref    12.1948
Grid : Message : 10.390220 s : norm dag diff   7.52814e-14
Grid : Message : 10.597580 s : Calling Deo and Doe and //assert Deo+Doe == Dunprec
Grid : Message : 10.102292 s : src_e0.502726
Grid : Message : 10.111715 s : src_o0.509216
Grid : Message : 10.120448 s : *********************************************************
Grid : Message : 10.120450 s : * Benchmarking DomainWallFermionF::DhopEO                
Grid : Message : 10.120451 s : * Vectorising space-time by 8
Grid : Message : 10.120453 s : * SINGLE precision 
Grid : Message : 10.120454 s : * Using Nc=3       WilsonKernels
Grid : Message : 10.120455 s : *********************************************************
Grid : Message : 12.975335 s : Deo mflop/s =   242917
Grid : Message : 12.975349 s : Deo mflop/s per rank   242917
Grid : Message : 12.975351 s : Deo mflop/s per node   242917
Grid : Message : 12.975353 s : #### Dhop calls report 
Grid : Message : 12.975354 s : WilsonFermion5D Number of DhopEO Calls   : 1001
Grid : Message : 12.975356 s : WilsonFermion5D TotalTime   /Calls        : 2851.97 us
Grid : Message : 12.975358 s : WilsonFermion5D CommTime    /Calls        : 1.74825 us
Grid : Message : 12.975360 s : WilsonFermion5D FaceTime    /Calls        : 0 us
Grid : Message : 12.975362 s : WilsonFermion5D ComputeTime1/Calls        : 2850.14 us
Grid : Message : 12.975364 s : WilsonFermion5D ComputeTime2/Calls        : 0 us
Grid : Message : 12.975367 s : Average mflops/s per call                : 247231
Grid : Message : 12.975371 s : Average mflops/s per call per rank       : 247231
Grid : Message : 12.975374 s : Average mflops/s per call per node       : 247231
Grid : Message : 12.975377 s : Average mflops/s per call (full)         : 247072
Grid : Message : 12.975379 s : Average mflops/s per call per rank (full): 247072
Grid : Message : 12.975381 s : Average mflops/s per call per node (full): 247072
Grid : Message : 12.975384 s : WilsonFermion5D Stencil
Grid : Message : 12.975386 s : WilsonFermion5D StencilEven
Grid : Message : 12.975391 s : WilsonFermion5D StencilOdd
Grid : Message : 12.975394 s :  Stencil calls 1001
Grid : Message : 12.975401 s :  Stencil halogtime 0.0559441
Grid : Message : 12.975405 s :  Stencil gathertime 0
Grid : Message : 12.975408 s :  Stencil gathermtime 0
Grid : Message : 12.975410 s :  Stencil mergetime 0.0549451
Grid : Message : 12.975414 s :  Stencil decompresstime 0.0659341
Grid : Message : 12.975416 s : WilsonFermion5D Stencil     Reporti()
Grid : Message : 12.975417 s : WilsonFermion5D StencilEven Reporti()
Grid : Message : 12.975422 s : WilsonFermion5D StencilOdd  Reporti()
Grid : Message : 12.975425 s :  timer0 (HaloGatherOpt) 1.13487
Grid : Message : 12.975428 s :  timer1 (Communicate)   0.0879121
Grid : Message : 12.975432 s :  timer2 (CommsMerge )   0.157842
Grid : Message : 12.975435 s :  timer3 (commsMergeShm) 0.280719
Grid : Message : 12.989486 s : r_e12547.9
Grid : Message : 12.994729 s : r_o6.10374
Grid : Message : 13.188000 s : res1578.28
Grid : Message : 13.610200 s : norm diff   0
Grid : Message : 13.116742 s : norm diff even  0
Grid : Message : 13.126443 s : norm diff odd   0

philomat commented 3 years ago

I just ran the Test_wilson_clover on summit and the test ran without any errors

paboyle commented 3 years ago

Thanks.

My hypothesis that the --dslash-unroll might fix the performance issues is not correct then.

Glad to hear it re. Clover - it's a HIP / CUDA difference, and not general breakage of Clover.

More joy to look forward to....

if you were able to track down which accelerator_for/line of code fails with Clover, that would help.

philomat commented 3 years ago

The error occurs in the constructor of the WilsonCloverFermion. To be more precise it is happening in the ImportGauge in WilsonCloverFermion.h on line 109

paboyle commented 3 years ago

Could you either

A) run it under a debugger (gdb) and trap the fault and ask it for a back trace with "bt".

OR

B) go to:

Grid/qcd/action/fermion/implementation/WilsonCloverFermionImplementation.h

1) uncomment the cout << statements 2) Stick in a print statement at lines 73, 81, 91, and 127 and 140.

A) is not guaranteed to work because I don't know how the GPU runtime is operating, but significantly less effort if you know how to use a debugger, so it is what I would try first. I think there is a rocm-gdb or rocm-lldb tool

paboyle commented 3 years ago

though the AMD node I had access to, the rocm debugger didn't work for me.

philomat commented 3 years ago

I tried both option A and B.

The output of bt was not very enlightening.

Approach B) tells me the error is between line 81 and 91 in Grid/qcd/action/fermion/implementation/WilsonCloverFermionImplementation.h.

Am 20.03.2021 um 01:25 schrieb Peter Boyle @.***>:

though the AMD node I had access to, the rocm debugger didn't work for me.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

paboyle commented 3 years ago

that was enough to go on for me to eyeball at least one error.

paboyle commented 3 years ago

More later - I'll try and patch develop.

paboyle commented 3 years ago

Sorry - reviewed again and the code looks right. Darn it...

jdmaia commented 3 years ago

Hi guys,

I just saw this. I have been working on grid some some weeks now and it seems like the Wilson clover implementation exceeds the maximum limit of local memory per thread (128k for now). That could explain the runtime error (More recent ROCm releases have an assertion against that, which makes the code fail to compile).

philomat commented 3 years ago

Hi,

I just tried to compile Grid on a new AMD GPU (MI100) machine at JLab. Unfortunately, I get errors during compilation:

error: stack size limit exceeded (131088) in _ZN4Grid11LambdaApplyIZNS_3adjINS_7iScalarINS_7iMatrixINS3_INS_9Grid_simdIN6thrust7complexIdEENS_9GpuVectorILi4ENS_10GpuComplexI15HIP_vector_typeIdLj2EEEEEEEELi8EEELi4EEEEEEENS_7LatticeIT_EERKSK_EUlmmmE_EEvmmmSJ_ error: stack size limit exceeded (131552) in _ZN4Grid11LambdaApplyIZNS_12outerProductINS_7iScalarINS_7iVectorINS3_INS_9Grid_simdIN6thrust7complexIdEENS_9GpuVectorILi4ENS_10GpuComplexI15HIP_vector_typeIdLj2EEEEEEEELi8EEELi4EEEEESH_EENS_7LatticeIDTcl12outerProductcvT__EcvT0__EEEEERKNSI_ISJ_EERKNSI_ISK_EEEUlmmmE_EEvmmmSJ_ 2 errors generated when compiling for gfx906.

My configure command: ../configure --enable-unified=no --enable-shm=no --enable-accelerator=hip --enable-comms=mpi3-auto --enable-simd=GPU --enable-gen-simd-width=64 CXX=/opt/rocm-4.3.0/bin/hipcc MPICXX=mpicxx CXXFLAGS="-fPIC -I/opt/rocm-4.3.0/ -std=c++14"

Any ideas how to solve this?

chulwoo1 commented 3 years ago

For whatever it's worth, I'm seeing the same error on OLCF spock, with rocm 4.2.0.

Chulwoo

On 2021-08-18 11:12, philomat wrote:

Hi,

I just tried to compile Grid on a new AMD GPU (MI100) machine at JLab. Unfortunately, I get errors during compilation:

error: stack size limit exceeded (131088) in _ZN4Grid11LambdaApplyIZNS_3adjINS_7iScalarINS_7iMatrixINS3_INS_9Grid_simdIN6thrust7complexIdEENS_9GpuVectorILi4ENS_10GpuComplexI15HIP_vector_typeIdLj2EEEEEEEELi8EEELi4EEEEEEENS_7LatticeIT_EERKSK_EUlmmmEEEvmmmSJ error: stack size limit exceeded (131552) in _ZN4Grid11LambdaApplyIZNS_12outerProductINS_7iScalarINS_7iVectorINS3_INS_9Grid_simdIN6thrust7complexIdEENS_9GpuVectorILi4ENS_10GpuComplexI15HIP_vector_typeIdLj2EEEEEEEELi8EEELi4EEEEESH_EENS_7LatticeIDTcl12outerProductcvTEcvT0EEEEERKNSI_ISJ_EERKNSI_ISK_EEEUlmmmEEEvmmmSJ 2 errors generated when compiling for gfx906.

My configure command: ../configure --enable-unified=no --enable-shm=no --enable-accelerator=hip --enable-comms=mpi3-auto --enable-simd=GPU --enable-gen-simd-width=64 CXX=/opt/rocm-4.3.0/bin/hipcc MPICXX=mpicxx CXXFLAGS="-fPIC -I/opt/rocm-4.3.0/ -std=c++14"

Any ideas how to solve this?

-- You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub [1], or unsubscribe [2]. Triage notifications on the go with GitHub Mobile for iOS [3] or Android [4].

Links:

[1] https://github.com/paboyle/Grid/issues/343#issuecomment-901198058 [2] https://github.com/notifications/unsubscribe-auth/ABFOT73Q3VPIOPKSRHSKELTT5PEXJANCNFSM4YPC44IQ [3] https://urldefense.com/v3/__https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675__;!!P4SdNyxKAPE!UulLPgaa-YT0rQSGErIN-mCfLWKRbpv8iJ3prjvwxw10K-66idjNqWTCGV2pA8Vo$ [4] https://urldefense.com/v3/__https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email__;!!P4SdNyxKAPE!UulLPgaa-YT0rQSGErIN-mCfLWKRbpv8iJ3prjvwxw10K-66idjNqWTCGa3-jFZl$

paboyle commented 3 years ago

Can you give the complete call tree that is failing?

From: chulwoo1 @.> Reply to: paboyle/Grid @.> Date: Thursday, 19 August 2021 at 17:48 To: paboyle/Grid @.> Cc: Peter Boyle @.>, Comment @.***> Subject: Re: [paboyle/Grid] Compiling Grid for AMD GPUS (#343)

This email was sent to you by someone outside the University. You should only click on links or attachments if you are certain that the email is genuine and the content is safe. For whatever it's worth, I'm seeing the same error on OLCF spock, with rocm 4.2.0.

Chulwoo

On 2021-08-18 11:12, philomat wrote:

Hi,

I just tried to compile Grid on a new AMD GPU (MI100) machine at JLab. Unfortunately, I get errors during compilation:

error: stack size limit exceeded (131088) in _ZN4Grid11LambdaApplyIZNS_3adjINS_7iScalarINS_7iMatrixINS3_INS_9Grid_simdIN6thrust7complexIdEENS_9GpuVectorILi4ENS_10GpuComplexI15HIP_vector_typeIdLj2EEEEEEEELi8EEELi4EEEEEEENS_7LatticeIT_EERKSK_EUlmmmEEEvmmmSJ error: stack size limit exceeded (131552) in _ZN4Grid11LambdaApplyIZNS_12outerProductINS_7iScalarINS_7iVectorINS3_INS_9Grid_simdIN6thrust7complexIdEENS_9GpuVectorILi4ENS_10GpuComplexI15HIP_vector_typeIdLj2EEEEEEEELi8EEELi4EEEEESH_EENS_7LatticeIDTcl12outerProductcvTEcvT0EEEEERKNSI_ISJ_EERKNSI_ISK_EEEUlmmmEEEvmmmSJ 2 errors generated when compiling for gfx906.

My configure command: ../configure --enable-unified=no --enable-shm=no --enable-accelerator=hip --enable-comms=mpi3-auto --enable-simd=GPU --enable-gen-simd-width=64 CXX=/opt/rocm-4.3.0/bin/hipcc MPICXX=mpicxx CXXFLAGS="-fPIC -I/opt/rocm-4.3.0/ -std=c++14"

Any ideas how to solve this?

-- You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub [1], or unsubscribe [2]. Triage notifications on the go with GitHub Mobile for iOS [3] or Android [4].

Links:

[1] https://github.com/paboyle/Grid/issues/343#issuecomment-901198058 [2] https://github.com/notifications/unsubscribe-auth/ABFOT73Q3VPIOPKSRHSKELTT5PEXJANCNFSM4YPC44IQ [3] https://urldefense.com/v3/__https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675__;!!P4SdNyxKAPE!UulLPgaa-YT0rQSGErIN-mCfLWKRbpv8iJ3prjvwxw10K-66idjNqWTCGV2pA8Vo$ [4] https://urldefense.com/v3/__https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email__;!!P4SdNyxKAPE!UulLPgaa-YT0rQSGErIN-mCfLWKRbpv8iJ3prjvwxw10K-66idjNqWTCGa3-jFZl$

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/paboyle/Grid/issues/343#issuecomment-902075747, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABZRZTOJOVTNQ3KVMR4C2TDT5UYVPANCNFSM4YPC44IQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email.

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheann carthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.

philomat commented 3 years ago

Hi Peter,

this is the complete output of make:

[scior@qcdi2001 Grid]$ make cp version-cache Version.h make all-am make[1]: Entering directory '/u/home/scior/Grid/build/Grid' cp version-cache Version.h CXX util/version.o CXX qcd/action/fermion/instantiation/WilsonAdjImplD/WilsonCloverFermionInstantiationWilsonAdjImplD.o error: stack size limit exceeded (131088) in _ZN4Grid11LambdaApplyIZNS_3adjINS_7iScalarINS_7iMatrixINS3_INS_9Grid_simdIN6thrust7complexIdEENS_9GpuVectorILi4ENS_10GpuComplexI15HIP_vector_typeIdLj2EEEEEEEELi8EEELi4EEEEEEENS_7LatticeIT_EERKSK_EUlmmmE_EEvmmmSJ_ error: stack size limit exceeded (131552) in _ZN4Grid11LambdaApplyIZNS_12outerProductINS_7iScalarINS_7iVectorINS3_INS_9Grid_simdIN6thrust7complexIdEENS_9GpuVectorILi4ENS_10GpuComplexI15HIP_vector_typeIdLj2EEEEEEEELi8EEELi4EEEEESH_EENS_7LatticeIDTcl12outerProductcvT__EcvT0__EEEEERKNSI_ISJ_EERKNSI_ISK_EEEUlmmmE_EEvmmmSJ_ 2 errors generated when compiling for gfx906. make[1]: *** [Makefile:3468: qcd/action/fermion/instantiation/WilsonAdjImplD/WilsonCloverFermionInstantiationWilsonAdjImplD.o] Error 1 make[1]: Leaving directory '/u/home/scior/Grid/build/Grid' make: *** [Makefile:2490: all] Error 2

philomat commented 3 years ago

Any progress on this issue? I pulled Grid a couple of days ago and still get the same error.

jdmaia commented 2 years ago

@philomat For now I'm avoiding hitting this problem by conditionally compiling the problematic operators, which seems to be fine to build the main benchmark binary (Benchmark_ITT), but I still need to take a look at the code and see if we can reduce the amount of local data allocated per thread and place it somewhere else to avoid hitting this issue.

paboyle commented 2 years ago

I've run on Spock and doing well on Benchmark_ITT and Benchmark_dwf_fp32.

Added the systems/Spock directory with compile and run scripts. Getting 1.3TF/s on MI100.

paboyle commented 2 years ago

Also get 4TF/s on a whole Spock node, 4x MI-100.

james-simone commented 2 years ago

I'm also hitting the "stack frame size exceeds limit" error. commit: HEAD detached at 135808dc Ubuntu 20.04 container rocm-5.0.0/clang/14.0.0 CXX=hipcc CXXFLAGS=" -std=c++14 -I/opt/rocm/rocthrust/include -I/usr/local/openmpi/include -I/usr/local/fftw/include -I/usr/local/hdf5/include -I/usr/local/scidac/include " LDFLAGS=" -L/opt/rocm/rocthrust/lib -L/usr/local/openmpi/lib -L/usr/local/fftw/lib -L/usr/local/hdf5/lib -L/usr/local/scidac/lib " LIBS="-lmpi" MPICXX=mpicxx /var/tmp/Grid/configure --prefix=/usr/local/grid --enable-accelerator=hip --enable-comms=mpi3-auto --enable-gen-simd-width=64 --enable-numa --enable-openmp --enable-simd=GPU --enable-unified=no

atamazov commented 1 year ago

There are some hardware-related limitations of the stack frame on AMGPUs. You need to reduce usage of private memory in the kernels.

Note that gfx10 GPUs can use twice more private memory than gfx9 because of narrower wavesize (32 vs 64).

Details can be found here: https://github.com/llvm/llvm-project/commit/1ed4caff1d5cd49233c1ae7b9f6483a946ed5eea

paboyle / Grid

Compiling Grid for AMD GPUS #343

Links:

Links: