Closed mark-petersen closed 1 year ago
Here are more lengthly notes. I can compile and run on the login node on my simple_yakl_tests branch with
cd cpp/build
source cmake_summit_gnu.sh
make
./simple_yakl_tests
and can also run on summit compute with
bsub -W 2:00 -nnodes 1 -P CLI115 -Is /bin/bash
source ${MODULESHOME}/init/bash
module purge
module load DefApps gcc/9.3.0 cuda parallel-netcdf cmake
jsrun -n 1 -a 1 -c 1 -g 1 ./simple_yakl_tests
The output for these two cases is identical:
``` $ pwd /gpfs/alpine/cli115/scratch/mpetersen/repos/miniWeather/simple_yakl_tests/cpp/build $ source cmake_summit_gnu.sh Lmod is automatically replacing "xl/16.1.1-10" with "gcc/9.3.0". -- The CXX compiler identification is GNU 9.3.0 -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/spectrum-mpi-10.4.0.3-20210112-6depextb6p6ulrvmehgtbskbmcsyhtdi/bin/mpic++ - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- The CUDA compiler identification is NVIDIA 11.0.221 -- Detecting CUDA compiler ABI info -- Detecting CUDA compiler ABI info - done -- Check for working CUDA compiler: /sw/summit/cuda/11.0.3/bin/nvcc - skipped -- Detecting CUDA compile features -- Detecting CUDA compile features - done -- Using double precision -- The C compiler identification is GNU 9.3.0 -- The Fortran compiler identification is GNU 9.3.0 /lib/../lib64/crt1.o:(.rodata+0x8): undefined reference to `main' /usr/bin/ld: link errors found, deleting executable `a.out' collect2: error: ld returned 1 exit status -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Check for working C compiler: /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/spectrum-mpi-10.4.0.3-20210112-6depextb6p6ulrvmehgtbskbmcsyhtdi/bin/mpicc - skipped -- Detecting C compile features -- Detecting C compile features - done -- Detecting Fortran compiler ABI info -- Detecting Fortran compiler ABI info - done -- Check for working Fortran compiler: /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/spectrum-mpi-10.4.0.3-20210112-6depextb6p6ulrvmehgtbskbmcsyhtdi/bin/mpif90 - skipped -- ** Building YAKL for a CUDA backend ** -- ** YAKL is using the following C++ flags: -DYAKL_ARCH_CUDA --expt-extended-lambda --expt-relaxed-constexpr -Wno-deprecated-gpu-targets -std=c++17 -DHAVE_MPI -O3 --use_fast_math -arch sm_70 -ccbin mpic++ -I/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/parallel-netcdf-1.12.2-wr65dxzaz6topsdmlgzw2xyzn7w6uvs7/include ** -- Configuring done -- Generating done -- Build files have been written to: /gpfs/alpine/cli115/scratch/mpetersen/repos/miniWeather/simple_yakl_tests/cpp/build $ make Scanning dependencies of target yakl [ 4%] Building Fortran object YAKL/CMakeFiles/yakl.dir/src/YAKL_gator_mod.F90.o [ 9%] Building CUDA object YAKL/CMakeFiles/yakl.dir/src/YAKL.cpp.o [ 13%] Linking CUDA device code CMakeFiles/yakl.dir/cmake_device_link.o [ 18%] Linking CUDA static library libyakl.a [ 18%] Built target yakl [ 22%] Building CUDA object CMakeFiles/simple_yakl_tests.dir/simple_yakl_tests.cpp.o [ 27%] Linking CXX executable simple_yakl_tests [ 27%] Built target simple_yakl_tests [ 31%] Building CUDA object CMakeFiles/serial.dir/miniWeather_serial.cpp.o /gpfs/alpine/cli115/scratch/mpetersen/repos/miniWeather/simple_yakl_tests/cpp/miniWeather_serial.cpp(375): warning: variable "left_rank" was declared but never referenced /gpfs/alpine/cli115/scratch/mpetersen/repos/miniWeather/simple_yakl_tests/cpp/miniWeather_serial.cpp(376): warning: variable "right_rank" was declared but never referenced /gpfs/alpine/cli115/scratch/mpetersen/repos/miniWeather/simple_yakl_tests/cpp/miniWeather_serial.cpp(464): warning: variable "ierr" was set but never used [ 36%] Linking CXX executable serial [ 36%] Built target serial [ 40%] Building CUDA object CMakeFiles/serial_test.dir/miniWeather_serial.cpp.o /gpfs/alpine/cli115/scratch/mpetersen/repos/miniWeather/simple_yakl_tests/cpp/miniWeather_serial.cpp(375): warning: variable "left_rank" was declared but never referenced /gpfs/alpine/cli115/scratch/mpetersen/repos/miniWeather/simple_yakl_tests/cpp/miniWeather_serial.cpp(376): warning: variable "right_rank" was declared but never referenced /gpfs/alpine/cli115/scratch/mpetersen/repos/miniWeather/simple_yakl_tests/cpp/miniWeather_serial.cpp(464): warning: variable "ierr" was set but never used [ 45%] Linking CXX executable serial_test [ 45%] Built target serial_test [ 50%] Building CUDA object CMakeFiles/mpi.dir/miniWeather_mpi.cpp.o /gpfs/alpine/cli115/scratch/mpetersen/repos/miniWeather/simple_yakl_tests/cpp/miniWeather_mpi.cpp(381): warning: variable "ierr" was set but never used /gpfs/alpine/cli115/scratch/mpetersen/repos/miniWeather/simple_yakl_tests/cpp/miniWeather_mpi.cpp(520): warning: variable "ierr" was set but never used [ 54%] Linking CXX executable mpi [ 54%] Built target mpi [ 59%] Building CUDA object CMakeFiles/mpi_test.dir/miniWeather_mpi.cpp.o /gpfs/alpine/cli115/scratch/mpetersen/repos/miniWeather/simple_yakl_tests/cpp/miniWeather_mpi.cpp(381): warning: variable "ierr" was set but never used /gpfs/alpine/cli115/scratch/mpetersen/repos/miniWeather/simple_yakl_tests/cpp/miniWeather_mpi.cpp(520): warning: variable "ierr" was set but never used [ 63%] Linking CXX executable mpi_test [ 63%] Built target mpi_test [ 68%] Building CUDA object CMakeFiles/parallelfor.dir/miniWeather_mpi_parallelfor.cpp.o /gpfs/alpine/cli115/scratch/mpetersen/repos/miniWeather/simple_yakl_tests/cpp/miniWeather_mpi_parallelfor.cpp(379): warning: variable "ierr" was set but never used /gpfs/alpine/cli115/scratch/mpetersen/repos/miniWeather/simple_yakl_tests/cpp/miniWeather_mpi_parallelfor.cpp(518): warning: variable "ierr" was set but never used [ 72%] Linking CXX executable parallelfor [ 72%] Built target parallelfor [ 77%] Building CUDA object CMakeFiles/parallelfor_test.dir/miniWeather_mpi_parallelfor.cpp.o /gpfs/alpine/cli115/scratch/mpetersen/repos/miniWeather/simple_yakl_tests/cpp/miniWeather_mpi_parallelfor.cpp(379): warning: variable "ierr" was set but never used /gpfs/alpine/cli115/scratch/mpetersen/repos/miniWeather/simple_yakl_tests/cpp/miniWeather_mpi_parallelfor.cpp(518): warning: variable "ierr" was set but never used [ 81%] Linking CXX executable parallelfor_test [ 81%] Built target parallelfor_test [ 86%] Building CUDA object CMakeFiles/parallelfor_simd_x.dir/miniWeather_mpi_parallelfor_simd_x.cpp.o /gpfs/alpine/cli115/scratch/mpetersen/repos/miniWeather/simple_yakl_tests/cpp/miniWeather_mpi_parallelfor_simd_x.cpp(406): warning: variable "ierr" was set but never used /gpfs/alpine/cli115/scratch/mpetersen/repos/miniWeather/simple_yakl_tests/cpp/miniWeather_mpi_parallelfor_simd_x.cpp(528): warning: variable "ierr" was set but never used [ 90%] Linking CXX executable parallelfor_simd_x [ 90%] Built target parallelfor_simd_x [ 95%] Building CUDA object CMakeFiles/parallelfor_simd_x_test.dir/miniWeather_mpi_parallelfor_simd_x.cpp.o /gpfs/alpine/cli115/scratch/mpetersen/repos/miniWeather/simple_yakl_tests/cpp/miniWeather_mpi_parallelfor_simd_x.cpp(406): warning: variable "ierr" was set but never used /gpfs/alpine/cli115/scratch/mpetersen/repos/miniWeather/simple_yakl_tests/cpp/miniWeather_mpi_parallelfor_simd_x.cpp(528): warning: variable "ierr" was set but never used [100%] Linking CXX executable parallelfor_simd_x_test [100%] Built target parallelfor_simd_x_test $ cd ../ $ ./simple_yakl_tests -bash: ./simple_yakl_tests: No such file or directory $ cd build $ ./simple_yakl_tests Tesla V100-SXM2-16GB workArray before copy For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 0 1 2 10 11 12 20 21 22 100 101 102 110 111 112 120 121 122 200 201 202 210 211 212 220 221 222 workArray_cpu before copy For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 3.12843e-315 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 workArray_cpu after copy For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 0 1 2 10 11 12 20 21 22 100 101 102 110 111 112 120 121 122 200 201 202 210 211 212 220 221 222 workArray after copy back For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 0.4 1.4 2.4 10.4 11.4 12.4 20.4 21.4 22.4 100.4 101.4 102.4 110.4 111.4 112.4 120.4 121.4 122.4 200.4 201.4 202.4 210.4 211.4 212.4 220.4 221.4 222.4 workArray_cpu after copy back For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 0.4 1.4 2.4 10.4 11.4 12.4 20.4 21.4 22.4 100.4 101.4 102.4 110.4 111.4 112.4 120.4 121.4 122.4 200.4 201.4 202.4 210.4 211.4 212.4 220.4 221.4 222.4 Pool Memory High Water Mark: 256 Pool Memory High Water Efficiency: 2.38419e-07 ```
``` $ jsrun -n 1 -a 1 -c 1 -g 1 ./simple_yakl_tests Tesla V100-SXM2-16GB workArray before copy For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 0 1 2 10 11 12 20 21 22 100 101 102 110 111 112 120 121 122 200 201 202 210 211 212 220 221 222 workArray_cpu before copy For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 4.65907e-315 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 workArray_cpu after copy For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 0 1 2 10 11 12 20 21 22 100 101 102 110 111 112 120 121 122 200 201 202 210 211 212 220 221 222 workArray after copy back For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 0.4 1.4 2.4 10.4 11.4 12.4 20.4 21.4 22.4 100.4 101.4 102.4 110.4 111.4 112.4 120.4 121.4 122.4 200.4 201.4 202.4 210.4 211.4 212.4 220.4 221.4 222.4 workArray_cpu after copy back For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 0.4 1.4 2.4 10.4 11.4 12.4 20.4 21.4 22.4 100.4 101.4 102.4 110.4 111.4 112.4 120.4 121.4 122.4 200.4 201.4 202.4 210.4 211.4 212.4 220.4 221.4 222.4 Pool Memory High Water Mark: 256 Pool Memory High Water Efficiency: 2.38419e-07 ```
But running on the compute node with the cpu only initializes the gpu array to zero within the parallel_for
(or just doesn't touch it at all)
``` $ jsrun -n 1 -a 1 -c 1 ./simple_yakl_tests workArray before copy For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 0 0 0 0 0 0 0 1.73835e-310 3.95253e-322 3.95253e-322 0 0 2.122e-314 0 0 3.50253e-315 3.50253e-315 3.47668e-310 -nan 3.21143e-322 3.50222e-315 3.50224e-315 2.76677e-322 3.39519e-313 4.94066e-324 0 1.10671e-321 workArray_cpu before copy For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 1.73835e-310 3.50224e-315 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3.00392e-321 3.16202e-322 3.50216e-315 0 2.122e-314 8.48798e-314 -nan 3.50198e-315 1.10671e-321 workArray_cpu after copy For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 1.73835e-310 3.50224e-315 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3.00392e-321 3.16202e-322 3.50216e-315 0 2.122e-314 8.48798e-314 -nan 3.50198e-315 1.10671e-321 Pool Memory High Water Mark: 256 Pool Memory High Water Efficiency: 2.38419e-07 su:build$ ./simple_yakl_tests Tesla V100-SXM2-16GB workArray before copy For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 0 1 2 10 11 12 20 21 22 100 101 102 110 111 112 120 121 122 200 201 202 210 211 212 220 221 222 workArray_cpu before copy For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 1.74871e-315 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 workArray_cpu after copy For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 0 1 2 10 11 12 20 21 22 100 101 102 110 111 112 120 121 122 200 201 202 210 211 212 220 221 222 workArray after copy back For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 0.4 1.4 2.4 10.4 11.4 12.4 20.4 21.4 22.4 100.4 101.4 102.4 110.4 111.4 112.4 120.4 121.4 122.4 200.4 201.4 202.4 210.4 211.4 212.4 220.4 221.4 222.4 workArray_cpu after copy back For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 0.4 1.4 2.4 10.4 11.4 12.4 20.4 21.4 22.4 100.4 101.4 102.4 110.4 111.4 112.4 120.4 121.4 122.4 200.4 201.4 202.4 210.4 211.4 212.4 220.4 221.4 222.4 Pool Memory High Water Mark: 256 Pool Memory High Water Efficiency: 2.38419e-07 ```
This now works on perlmutter. Starting on the simple_yakl_tests branch, here are my commands.
For CPU
cd cpp/build
source cmake_perlmutter_gnu_cpu.sh # note cpu at end
make
salloc --nodes 1 --qos interactive --time 01:00:00 --account=e3sm --constraint cpu # note cpu here
source ${MODULESHOME}/init/bash
module purge
module load PrgEnv-gnu gcc/11.2.0 cray-mpich cray-parallel-netcdf cmake
./simple_yakl_tests # or
srun -n 1 ./simple_yakl_tests
For GPU
cd cpp/build
source cmake_perlmutter_gnu_gpu.sh # note gpu at end
make
salloc --nodes 1 --qos interactive --time 01:00:00 --account=e3sm --constraint gpu # note gpu here
source ${MODULESHOME}/init/bash
module purge
module load PrgEnv-gnu gcc/11.2.0 cray-mpich cray-parallel-netcdf cmake
srun -n 1 -G4 ./simple_yakl_tests # -G4 runs on 4 gpus.
On perlmutter, the output looks right for both cpu and gpu, but my question still remains. How do I know that computations are actually running on the gpu?
``` $ srun -n 1 -G4 ./simple_yakl_tests NVIDIA A100-SXM4-40GB workArray before copy For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 0 1 2 10 11 12 20 21 22 100 101 102 110 111 112 120 121 122 200 201 202 210 211 212 220 221 222 workArray_cpu before copy For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 0 0 2.70568e-317 4.24399e-314 1.2733e-313 5.43231e-312 5.43264e-312 1.10715e-310 1.10724e-310 1.10724e-310 1.10726e-310 1.10726e-310 1.10726e-310 1.10726e-310 1.10726e-310 1.10726e-310 1.10729e-310 1.10729e-310 1.10729e-310 1.10729e-310 1.10729e-310 6.95276e-310 6.95276e-310 6.95276e-310 6.95276e-310 -nan 0 workArray_cpu after copy For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 0 1 2 10 11 12 20 21 22 100 101 102 110 111 112 120 121 122 200 201 202 210 211 212 220 221 222 workArray after copy back For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 0.4 1.4 2.4 10.4 11.4 12.4 20.4 21.4 22.4 100.4 101.4 102.4 110.4 111.4 112.4 120.4 121.4 122.4 200.4 201.4 202.4 210.4 211.4 212.4 220.4 221.4 222.4 workArray_cpu after copy back For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 0.4 1.4 2.4 10.4 11.4 12.4 20.4 21.4 22.4 100.4 101.4 102.4 110.4 111.4 112.4 120.4 121.4 122.4 200.4 201.4 202.4 210.4 211.4 212.4 220.4 221.4 222.4 Pool Memory High Water Mark: 256 Pool Memory High Water Efficiency: 2.38419e-07 ```
After speaking with @normanmr I understand this now. In all cases, I am running on the GPUs. I was confused because the front-end nodes have GPUs available, so I'm actually running on the GPUs there too. If I try to run a YAKL parallel_for
function on a node without gpus, it will die with an error.
In order to monitor the GPU computations, use the nvidia profiler nvprof
. This can be run as a prefix command as follows. On the compute nodes:
jsrun -n 1 -a 1 -c 1 -g 1 nvprof ./simple_yakl_tests // on summit
srun -n 1 -G4 nsys nvprof ./simple_yakl_tests // on perlmutter
Note that perlmutter required nsys
to execute nvprof
.
```
$ salloc --nodes 1 --qos interactive --time 01:00:00 --constraint gpu --account=e3sm
pm:build$ source ${MODULESHOME}/init/bash
pm:build$ module purge
Unloading the cpe module is insufficient to restore the system defaults.
Please run 'source /opt/cray/pe/cpe/23.02/restore_lmod_system_defaults.[csh|sh]'.
pm:build$ module load PrgEnv-nvidia cray-mpich cray-parallel-netcdf cmake cudatoolkit
pm:build$ srun -n 1 -G4 nsys nvprof ./simple_yakl_tests
WARNING: simple_yakl_tests and any of its children processes will be profiled.
NVIDIA A100-SXM4-40GB
workArray before copy
For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined"
Number of Dimensions: 3
Total Number of Elements: 27
Dimension Sizes: 3, 3, 3,
0 1 2 10 11 12 20 21 22 100 101 102 110 111 112 120 121 122 200 201 202 210 211 212 220 221 222
workArray_cpu before copy
For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined"
Number of Dimensions: 3
Total Number of Elements: 27
Dimension Sizes: 3, 3, 3,
0 0 2.78865e-317 4.24399e-314 1.2733e-313 5.43231e-312 5.43264e-312 1.39228e-311 1.39235e-311 1.1175e-310 1.11751e-310 1.11751e-310 1.11752e-310 1.11756e-310 1.11757e-310 1.11757e-310 1.11759e-310 1.11759e-310 1.11759e-310 1.11759e-310 1.11759e-310 6.95305e-310 6.95305e-310 6.95305e-310 6.95305e-310 -nan 0
workArray_cpu after copy
For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined"
Number of Dimensions: 3
Total Number of Elements: 27
Dimension Sizes: 3, 3, 3,
0 1 2 10 11 12 20 21 22 100 101 102 110 111 112 120 121 122 200 201 202 210 211 212 220 221 222
workArray after copy back
For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined"
Number of Dimensions: 3
Total Number of Elements: 27
Dimension Sizes: 3, 3, 3,
0.4 1.4 2.4 10.4 11.4 12.4 20.4 21.4 22.4 100.4 101.4 102.4 110.4 111.4 112.4 120.4 121.4 122.4 200.4 201.4 202.4 210.4 211.4 212.4 220.4 221.4 222.4
workArray_cpu after copy back
For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined"
Number of Dimensions: 3
Total Number of Elements: 27
Dimension Sizes: 3, 3, 3,
0.4 1.4 2.4 10.4 11.4 12.4 20.4 21.4 22.4 100.4 101.4 102.4 110.4 111.4 112.4 120.4 121.4 122.4 200.4 201.4 202.4 210.4 211.4 212.4 220.4 221.4 222.4
Pool Memory High Water Mark: 256
Pool Memory High Water Efficiency: 2.38419e-07
Generating '/tmp/nsys-report-ffd4.qdstrm'
[1/7] [========================100%] report7.nsys-rep
[2/7] [========================100%] report7.sqlite
[3/7] Executing 'nvtxsum' stats report
NVTX Range Statistics:
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Style Range
-------- --------------- --------- --------- --------- -------- -------- ----------- ------- ------------
100.0 475,969 1 475,969.0 475,969.0 475,969 475,969 0.0 PushPop init array 1
[4/7] Executing 'cudaapisum' stats report
CUDA API Statistics:
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ------------- ------------- ----------- ----------- ----------- ----------------------
99.9 290,001,864 1 290,001,864.0 290,001,864.0 290,001,864 290,001,864 0.0 cudaMalloc
0.0 66,289 4 16,572.3 12,349.0 4,740 36,851 14,011.5 cudaMemcpyAsync
0.0 49,456 1 49,456.0 49,456.0 49,456 49,456 0.0 cudaLaunchKernel
0.0 34,948 7 4,992.6 2,545.0 852 15,791 5,186.4 cudaDeviceSynchronize
0.0 942 1 942.0 942.0 942 942 0.0 cuModuleGetLoadingMode
0.0 341 1 341.0 341.0 341 341 0.0 cudaFree
[5/7] Executing 'gpukernsum' stats report
CUDA Kernel Statistics:
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- -------- -------- -------- -------- ----------- ----------------------------------------------------------------------------------------------------
100.0 4,288 1 4,288.0 4,288.0 4,288 4,288 0.0 void yakl::c::cudaKernelVal Another handy command is
```
pm:build$ nvidia-smi
Mon Mar 20 07:45:10 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:03:00.0 Off | 0 |
| N/A 28C P0 52W / 400W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM... On | 00000000:41:00.0 Off | 0 |
| N/A 26C P0 49W / 400W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM... On | 00000000:82:00.0 Off | 0 |
| N/A 28C P0 53W / 400W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM... On | 00000000:C1:00.0 Off | 0 |
| N/A 27C P0 53W / 400W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
```
nvidia-smi
for NVIDIA System Management Interface to see information about your hardware. output:
Thanks @normanmr for the help. That answered my question.
When running with YAKL arrays, how do I know that something is actually being computed on the GPU?
I wrote the simplest possible code in order to:
deep_copy
array from GPU to CPU (link to code)deep_copy
array from CPU back to GPU (link to code)I added print (
std::cout
) statements after each of these for both the cpu and gpu arrays. I added to thecpp/CMakeLists.txt
file so it is included with the make command.After compiling, I can simply run on a summit log-in node with
./simple_yakl_tests
, and the gpu array initializes correctly. That concerns me, because theparallel_for
lines intended for the gpu must just be defaulting to the cpu in this case. When I run on a compute node withI get the identical output. I assume this is actually running the
parallel_for
on the gpu, but I really have no idea because the output is identical to the test on the log-in node.BTW, when I run on the compute node without a gpu:
the
parallel_for
initializes the gpu array to zeros. So there is at least some difference.Overall, I'm trying to develop some experience of what exactly is happening on the GPU, and I don't think I have the proper tools. Are there some other functions or output to help with this? Thanks.