mrnorman / miniWeather

A parallel programming training mini app simulating weather-like flows
Other
145 stars 67 forks source link

Question on simple test running on GPUs. #12

Closed mark-petersen closed 1 year ago

mark-petersen commented 1 year ago

When running with YAKL arrays, how do I know that something is actually being computed on the GPU?

I wrote the simplest possible code in order to:

  1. initialize a YAKL array on the GPUs with parallel_for (link to code)
  2. deep_copy array from GPU to CPU (link to code)
  3. alter cpu version of array (link to code)
  4. deep_copy array from CPU back to GPU (link to code)

I added print (std::cout) statements after each of these for both the cpu and gpu arrays. I added to the cpp/CMakeLists.txt file so it is included with the make command.

After compiling, I can simply run on a summit log-in node with ./simple_yakl_tests, and the gpu array initializes correctly. That concerns me, because the parallel_for lines intended for the gpu must just be defaulting to the cpu in this case. When I run on a compute node with

jsrun -n 1 -a 1 -c 1 -g 1 ./simple_yakl_tests

I get the identical output. I assume this is actually running the parallel_for on the gpu, but I really have no idea because the output is identical to the test on the log-in node.

BTW, when I run on the compute node without a gpu:

jsrun -n 1 -a 1 -c 1 ./simple_yakl_tests

the parallel_for initializes the gpu array to zeros. So there is at least some difference.

Overall, I'm trying to develop some experience of what exactly is happening on the GPU, and I don't think I have the proper tools. Are there some other functions or output to help with this? Thanks.

mark-petersen commented 1 year ago

Here are more lengthly notes. I can compile and run on the login node on my simple_yakl_tests branch with

cd cpp/build
source cmake_summit_gnu.sh
make
./simple_yakl_tests

and can also run on summit compute with

bsub -W 2:00 -nnodes 1 -P CLI115 -Is /bin/bash

source ${MODULESHOME}/init/bash
module purge
module load DefApps gcc/9.3.0 cuda parallel-netcdf cmake

jsrun -n 1 -a 1 -c 1 -g 1 ./simple_yakl_tests

The output for these two cases is identical:

log-in node compile and run:

``` $ pwd /gpfs/alpine/cli115/scratch/mpetersen/repos/miniWeather/simple_yakl_tests/cpp/build $ source cmake_summit_gnu.sh Lmod is automatically replacing "xl/16.1.1-10" with "gcc/9.3.0". -- The CXX compiler identification is GNU 9.3.0 -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/spectrum-mpi-10.4.0.3-20210112-6depextb6p6ulrvmehgtbskbmcsyhtdi/bin/mpic++ - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- The CUDA compiler identification is NVIDIA 11.0.221 -- Detecting CUDA compiler ABI info -- Detecting CUDA compiler ABI info - done -- Check for working CUDA compiler: /sw/summit/cuda/11.0.3/bin/nvcc - skipped -- Detecting CUDA compile features -- Detecting CUDA compile features - done -- Using double precision -- The C compiler identification is GNU 9.3.0 -- The Fortran compiler identification is GNU 9.3.0 /lib/../lib64/crt1.o:(.rodata+0x8): undefined reference to `main' /usr/bin/ld: link errors found, deleting executable `a.out' collect2: error: ld returned 1 exit status -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Check for working C compiler: /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/spectrum-mpi-10.4.0.3-20210112-6depextb6p6ulrvmehgtbskbmcsyhtdi/bin/mpicc - skipped -- Detecting C compile features -- Detecting C compile features - done -- Detecting Fortran compiler ABI info -- Detecting Fortran compiler ABI info - done -- Check for working Fortran compiler: /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/spectrum-mpi-10.4.0.3-20210112-6depextb6p6ulrvmehgtbskbmcsyhtdi/bin/mpif90 - skipped -- ** Building YAKL for a CUDA backend ** -- ** YAKL is using the following C++ flags: -DYAKL_ARCH_CUDA --expt-extended-lambda --expt-relaxed-constexpr -Wno-deprecated-gpu-targets -std=c++17 -DHAVE_MPI -O3 --use_fast_math -arch sm_70 -ccbin mpic++ -I/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/parallel-netcdf-1.12.2-wr65dxzaz6topsdmlgzw2xyzn7w6uvs7/include ** -- Configuring done -- Generating done -- Build files have been written to: /gpfs/alpine/cli115/scratch/mpetersen/repos/miniWeather/simple_yakl_tests/cpp/build $ make Scanning dependencies of target yakl [ 4%] Building Fortran object YAKL/CMakeFiles/yakl.dir/src/YAKL_gator_mod.F90.o [ 9%] Building CUDA object YAKL/CMakeFiles/yakl.dir/src/YAKL.cpp.o [ 13%] Linking CUDA device code CMakeFiles/yakl.dir/cmake_device_link.o [ 18%] Linking CUDA static library libyakl.a [ 18%] Built target yakl [ 22%] Building CUDA object CMakeFiles/simple_yakl_tests.dir/simple_yakl_tests.cpp.o [ 27%] Linking CXX executable simple_yakl_tests [ 27%] Built target simple_yakl_tests [ 31%] Building CUDA object CMakeFiles/serial.dir/miniWeather_serial.cpp.o /gpfs/alpine/cli115/scratch/mpetersen/repos/miniWeather/simple_yakl_tests/cpp/miniWeather_serial.cpp(375): warning: variable "left_rank" was declared but never referenced /gpfs/alpine/cli115/scratch/mpetersen/repos/miniWeather/simple_yakl_tests/cpp/miniWeather_serial.cpp(376): warning: variable "right_rank" was declared but never referenced /gpfs/alpine/cli115/scratch/mpetersen/repos/miniWeather/simple_yakl_tests/cpp/miniWeather_serial.cpp(464): warning: variable "ierr" was set but never used [ 36%] Linking CXX executable serial [ 36%] Built target serial [ 40%] Building CUDA object CMakeFiles/serial_test.dir/miniWeather_serial.cpp.o /gpfs/alpine/cli115/scratch/mpetersen/repos/miniWeather/simple_yakl_tests/cpp/miniWeather_serial.cpp(375): warning: variable "left_rank" was declared but never referenced /gpfs/alpine/cli115/scratch/mpetersen/repos/miniWeather/simple_yakl_tests/cpp/miniWeather_serial.cpp(376): warning: variable "right_rank" was declared but never referenced /gpfs/alpine/cli115/scratch/mpetersen/repos/miniWeather/simple_yakl_tests/cpp/miniWeather_serial.cpp(464): warning: variable "ierr" was set but never used [ 45%] Linking CXX executable serial_test [ 45%] Built target serial_test [ 50%] Building CUDA object CMakeFiles/mpi.dir/miniWeather_mpi.cpp.o /gpfs/alpine/cli115/scratch/mpetersen/repos/miniWeather/simple_yakl_tests/cpp/miniWeather_mpi.cpp(381): warning: variable "ierr" was set but never used /gpfs/alpine/cli115/scratch/mpetersen/repos/miniWeather/simple_yakl_tests/cpp/miniWeather_mpi.cpp(520): warning: variable "ierr" was set but never used [ 54%] Linking CXX executable mpi [ 54%] Built target mpi [ 59%] Building CUDA object CMakeFiles/mpi_test.dir/miniWeather_mpi.cpp.o /gpfs/alpine/cli115/scratch/mpetersen/repos/miniWeather/simple_yakl_tests/cpp/miniWeather_mpi.cpp(381): warning: variable "ierr" was set but never used /gpfs/alpine/cli115/scratch/mpetersen/repos/miniWeather/simple_yakl_tests/cpp/miniWeather_mpi.cpp(520): warning: variable "ierr" was set but never used [ 63%] Linking CXX executable mpi_test [ 63%] Built target mpi_test [ 68%] Building CUDA object CMakeFiles/parallelfor.dir/miniWeather_mpi_parallelfor.cpp.o /gpfs/alpine/cli115/scratch/mpetersen/repos/miniWeather/simple_yakl_tests/cpp/miniWeather_mpi_parallelfor.cpp(379): warning: variable "ierr" was set but never used /gpfs/alpine/cli115/scratch/mpetersen/repos/miniWeather/simple_yakl_tests/cpp/miniWeather_mpi_parallelfor.cpp(518): warning: variable "ierr" was set but never used [ 72%] Linking CXX executable parallelfor [ 72%] Built target parallelfor [ 77%] Building CUDA object CMakeFiles/parallelfor_test.dir/miniWeather_mpi_parallelfor.cpp.o /gpfs/alpine/cli115/scratch/mpetersen/repos/miniWeather/simple_yakl_tests/cpp/miniWeather_mpi_parallelfor.cpp(379): warning: variable "ierr" was set but never used /gpfs/alpine/cli115/scratch/mpetersen/repos/miniWeather/simple_yakl_tests/cpp/miniWeather_mpi_parallelfor.cpp(518): warning: variable "ierr" was set but never used [ 81%] Linking CXX executable parallelfor_test [ 81%] Built target parallelfor_test [ 86%] Building CUDA object CMakeFiles/parallelfor_simd_x.dir/miniWeather_mpi_parallelfor_simd_x.cpp.o /gpfs/alpine/cli115/scratch/mpetersen/repos/miniWeather/simple_yakl_tests/cpp/miniWeather_mpi_parallelfor_simd_x.cpp(406): warning: variable "ierr" was set but never used /gpfs/alpine/cli115/scratch/mpetersen/repos/miniWeather/simple_yakl_tests/cpp/miniWeather_mpi_parallelfor_simd_x.cpp(528): warning: variable "ierr" was set but never used [ 90%] Linking CXX executable parallelfor_simd_x [ 90%] Built target parallelfor_simd_x [ 95%] Building CUDA object CMakeFiles/parallelfor_simd_x_test.dir/miniWeather_mpi_parallelfor_simd_x.cpp.o /gpfs/alpine/cli115/scratch/mpetersen/repos/miniWeather/simple_yakl_tests/cpp/miniWeather_mpi_parallelfor_simd_x.cpp(406): warning: variable "ierr" was set but never used /gpfs/alpine/cli115/scratch/mpetersen/repos/miniWeather/simple_yakl_tests/cpp/miniWeather_mpi_parallelfor_simd_x.cpp(528): warning: variable "ierr" was set but never used [100%] Linking CXX executable parallelfor_simd_x_test [100%] Built target parallelfor_simd_x_test $ cd ../ $ ./simple_yakl_tests -bash: ./simple_yakl_tests: No such file or directory $ cd build $ ./simple_yakl_tests Tesla V100-SXM2-16GB workArray before copy For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 0 1 2 10 11 12 20 21 22 100 101 102 110 111 112 120 121 122 200 201 202 210 211 212 220 221 222 workArray_cpu before copy For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 3.12843e-315 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 workArray_cpu after copy For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 0 1 2 10 11 12 20 21 22 100 101 102 110 111 112 120 121 122 200 201 202 210 211 212 220 221 222 workArray after copy back For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 0.4 1.4 2.4 10.4 11.4 12.4 20.4 21.4 22.4 100.4 101.4 102.4 110.4 111.4 112.4 120.4 121.4 122.4 200.4 201.4 202.4 210.4 211.4 212.4 220.4 221.4 222.4 workArray_cpu after copy back For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 0.4 1.4 2.4 10.4 11.4 12.4 20.4 21.4 22.4 100.4 101.4 102.4 110.4 111.4 112.4 120.4 121.4 122.4 200.4 201.4 202.4 210.4 211.4 212.4 220.4 221.4 222.4 Pool Memory High Water Mark: 256 Pool Memory High Water Efficiency: 2.38419e-07 ```

compute node run on gpu: (with -g 1)

``` $ jsrun -n 1 -a 1 -c 1 -g 1 ./simple_yakl_tests Tesla V100-SXM2-16GB workArray before copy For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 0 1 2 10 11 12 20 21 22 100 101 102 110 111 112 120 121 122 200 201 202 210 211 212 220 221 222 workArray_cpu before copy For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 4.65907e-315 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 workArray_cpu after copy For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 0 1 2 10 11 12 20 21 22 100 101 102 110 111 112 120 121 122 200 201 202 210 211 212 220 221 222 workArray after copy back For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 0.4 1.4 2.4 10.4 11.4 12.4 20.4 21.4 22.4 100.4 101.4 102.4 110.4 111.4 112.4 120.4 121.4 122.4 200.4 201.4 202.4 210.4 211.4 212.4 220.4 221.4 222.4 workArray_cpu after copy back For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 0.4 1.4 2.4 10.4 11.4 12.4 20.4 21.4 22.4 100.4 101.4 102.4 110.4 111.4 112.4 120.4 121.4 122.4 200.4 201.4 202.4 210.4 211.4 212.4 220.4 221.4 222.4 Pool Memory High Water Mark: 256 Pool Memory High Water Efficiency: 2.38419e-07 ```

But running on the compute node with the cpu only initializes the gpu array to zero within the parallel_for (or just doesn't touch it at all)

compute node run on cpu: (without -g 1)

``` $ jsrun -n 1 -a 1 -c 1 ./simple_yakl_tests workArray before copy For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 0 0 0 0 0 0 0 1.73835e-310 3.95253e-322 3.95253e-322 0 0 2.122e-314 0 0 3.50253e-315 3.50253e-315 3.47668e-310 -nan 3.21143e-322 3.50222e-315 3.50224e-315 2.76677e-322 3.39519e-313 4.94066e-324 0 1.10671e-321 workArray_cpu before copy For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 1.73835e-310 3.50224e-315 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3.00392e-321 3.16202e-322 3.50216e-315 0 2.122e-314 8.48798e-314 -nan 3.50198e-315 1.10671e-321 workArray_cpu after copy For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 1.73835e-310 3.50224e-315 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3.00392e-321 3.16202e-322 3.50216e-315 0 2.122e-314 8.48798e-314 -nan 3.50198e-315 1.10671e-321 Pool Memory High Water Mark: 256 Pool Memory High Water Efficiency: 2.38419e-07 su:build$ ./simple_yakl_tests Tesla V100-SXM2-16GB workArray before copy For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 0 1 2 10 11 12 20 21 22 100 101 102 110 111 112 120 121 122 200 201 202 210 211 212 220 221 222 workArray_cpu before copy For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 1.74871e-315 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 workArray_cpu after copy For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 0 1 2 10 11 12 20 21 22 100 101 102 110 111 112 120 121 122 200 201 202 210 211 212 220 221 222 workArray after copy back For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 0.4 1.4 2.4 10.4 11.4 12.4 20.4 21.4 22.4 100.4 101.4 102.4 110.4 111.4 112.4 120.4 121.4 122.4 200.4 201.4 202.4 210.4 211.4 212.4 220.4 221.4 222.4 workArray_cpu after copy back For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 0.4 1.4 2.4 10.4 11.4 12.4 20.4 21.4 22.4 100.4 101.4 102.4 110.4 111.4 112.4 120.4 121.4 122.4 200.4 201.4 202.4 210.4 211.4 212.4 220.4 221.4 222.4 Pool Memory High Water Mark: 256 Pool Memory High Water Efficiency: 2.38419e-07 ```

mark-petersen commented 1 year ago

This now works on perlmutter. Starting on the simple_yakl_tests branch, here are my commands.

For CPU

cd cpp/build
source cmake_perlmutter_gnu_cpu.sh # note cpu at end
make

salloc --nodes 1 --qos interactive --time 01:00:00 --account=e3sm --constraint cpu # note cpu here

source ${MODULESHOME}/init/bash
module purge
module load PrgEnv-gnu gcc/11.2.0 cray-mpich cray-parallel-netcdf cmake

./simple_yakl_tests # or
srun -n 1 ./simple_yakl_tests

For GPU

cd cpp/build
source cmake_perlmutter_gnu_gpu.sh # note gpu at end
make

salloc --nodes 1 --qos interactive --time 01:00:00 --account=e3sm --constraint gpu # note gpu here

source ${MODULESHOME}/init/bash
module purge
module load PrgEnv-gnu gcc/11.2.0 cray-mpich cray-parallel-netcdf cmake

srun -n 1 -G4 ./simple_yakl_tests # -G4 runs on 4 gpus.
mark-petersen commented 1 year ago

On perlmutter, the output looks right for both cpu and gpu, but my question still remains. How do I know that computations are actually running on the gpu?

Output with GPUs:

``` $ srun -n 1 -G4 ./simple_yakl_tests NVIDIA A100-SXM4-40GB workArray before copy For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 0 1 2 10 11 12 20 21 22 100 101 102 110 111 112 120 121 122 200 201 202 210 211 212 220 221 222 workArray_cpu before copy For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 0 0 2.70568e-317 4.24399e-314 1.2733e-313 5.43231e-312 5.43264e-312 1.10715e-310 1.10724e-310 1.10724e-310 1.10726e-310 1.10726e-310 1.10726e-310 1.10726e-310 1.10726e-310 1.10726e-310 1.10729e-310 1.10729e-310 1.10729e-310 1.10729e-310 1.10729e-310 6.95276e-310 6.95276e-310 6.95276e-310 6.95276e-310 -nan 0 workArray_cpu after copy For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 0 1 2 10 11 12 20 21 22 100 101 102 110 111 112 120 121 122 200 201 202 210 211 212 220 221 222 workArray after copy back For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 0.4 1.4 2.4 10.4 11.4 12.4 20.4 21.4 22.4 100.4 101.4 102.4 110.4 111.4 112.4 120.4 121.4 122.4 200.4 201.4 202.4 210.4 211.4 212.4 220.4 221.4 222.4 workArray_cpu after copy back For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 0.4 1.4 2.4 10.4 11.4 12.4 20.4 21.4 22.4 100.4 101.4 102.4 110.4 111.4 112.4 120.4 121.4 122.4 200.4 201.4 202.4 210.4 211.4 212.4 220.4 221.4 222.4 Pool Memory High Water Mark: 256 Pool Memory High Water Efficiency: 2.38419e-07 ```

mark-petersen commented 1 year ago

After speaking with @normanmr I understand this now. In all cases, I am running on the GPUs. I was confused because the front-end nodes have GPUs available, so I'm actually running on the GPUs there too. If I try to run a YAKL parallel_for function on a node without gpus, it will die with an error.

In order to monitor the GPU computations, use the nvidia profiler nvprof. This can be run as a prefix command as follows. On the compute nodes:

jsrun -n 1 -a 1 -c 1 -g 1 nvprof ./simple_yakl_tests  // on summit
srun -n 1 -G4 nsys nvprof ./simple_yakl_tests  // on perlmutter

Note that perlmutter required nsys to execute nvprof.

The output from nvprof is as follows:

``` $ salloc --nodes 1 --qos interactive --time 01:00:00 --constraint gpu --account=e3sm pm:build$ source ${MODULESHOME}/init/bash pm:build$ module purge Unloading the cpe module is insufficient to restore the system defaults. Please run 'source /opt/cray/pe/cpe/23.02/restore_lmod_system_defaults.[csh|sh]'. pm:build$ module load PrgEnv-nvidia cray-mpich cray-parallel-netcdf cmake cudatoolkit pm:build$ srun -n 1 -G4 nsys nvprof ./simple_yakl_tests WARNING: simple_yakl_tests and any of its children processes will be profiled. NVIDIA A100-SXM4-40GB workArray before copy For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 0 1 2 10 11 12 20 21 22 100 101 102 110 111 112 120 121 122 200 201 202 210 211 212 220 221 222 workArray_cpu before copy For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 0 0 2.78865e-317 4.24399e-314 1.2733e-313 5.43231e-312 5.43264e-312 1.39228e-311 1.39235e-311 1.1175e-310 1.11751e-310 1.11751e-310 1.11752e-310 1.11756e-310 1.11757e-310 1.11757e-310 1.11759e-310 1.11759e-310 1.11759e-310 1.11759e-310 1.11759e-310 6.95305e-310 6.95305e-310 6.95305e-310 6.95305e-310 -nan 0 workArray_cpu after copy For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 0 1 2 10 11 12 20 21 22 100 101 102 110 111 112 120 121 122 200 201 202 210 211 212 220 221 222 workArray after copy back For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 0.4 1.4 2.4 10.4 11.4 12.4 20.4 21.4 22.4 100.4 101.4 102.4 110.4 111.4 112.4 120.4 121.4 122.4 200.4 201.4 202.4 210.4 211.4 212.4 220.4 221.4 222.4 workArray_cpu after copy back For Array labeled: "Unlabeled: YAKL_DEBUG CPP macro not defined" Number of Dimensions: 3 Total Number of Elements: 27 Dimension Sizes: 3, 3, 3, 0.4 1.4 2.4 10.4 11.4 12.4 20.4 21.4 22.4 100.4 101.4 102.4 110.4 111.4 112.4 120.4 121.4 122.4 200.4 201.4 202.4 210.4 211.4 212.4 220.4 221.4 222.4 Pool Memory High Water Mark: 256 Pool Memory High Water Efficiency: 2.38419e-07 Generating '/tmp/nsys-report-ffd4.qdstrm' [1/7] [========================100%] report7.nsys-rep [2/7] [========================100%] report7.sqlite [3/7] Executing 'nvtxsum' stats report NVTX Range Statistics: Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Style Range -------- --------------- --------- --------- --------- -------- -------- ----------- ------- ------------ 100.0 475,969 1 475,969.0 475,969.0 475,969 475,969 0.0 PushPop init array 1 [4/7] Executing 'cudaapisum' stats report CUDA API Statistics: Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name -------- --------------- --------- ------------- ------------- ----------- ----------- ----------- ---------------------- 99.9 290,001,864 1 290,001,864.0 290,001,864.0 290,001,864 290,001,864 0.0 cudaMalloc 0.0 66,289 4 16,572.3 12,349.0 4,740 36,851 14,011.5 cudaMemcpyAsync 0.0 49,456 1 49,456.0 49,456.0 49,456 49,456 0.0 cudaLaunchKernel 0.0 34,948 7 4,992.6 2,545.0 852 15,791 5,186.4 cudaDeviceSynchronize 0.0 942 1 942.0 942.0 942 942 0.0 cuModuleGetLoadingMode 0.0 341 1 341.0 341.0 341 341 0.0 cudaFree [5/7] Executing 'gpukernsum' stats report CUDA Kernel Statistics: Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name -------- --------------- --------- -------- -------- -------- -------- ----------- ---------------------------------------------------------------------------------------------------- 100.0 4,288 1 4,288.0 4,288.0 4,288 4,288 0.0 void yakl::c::cudaKernelVal

Another handy command is nvidia-smi for NVIDIA System Management Interface to see information about your hardware.

output:

``` pm:build$ nvidia-smi Mon Mar 20 07:45:10 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A100-SXM... On | 00000000:03:00.0 Off | 0 | | N/A 28C P0 52W / 400W | 0MiB / 40960MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA A100-SXM... On | 00000000:41:00.0 Off | 0 | | N/A 26C P0 49W / 400W | 0MiB / 40960MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA A100-SXM... On | 00000000:82:00.0 Off | 0 | | N/A 28C P0 53W / 400W | 0MiB / 40960MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 3 NVIDIA A100-SXM... On | 00000000:C1:00.0 Off | 0 | | N/A 27C P0 53W / 400W | 0MiB / 40960MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ ```

mark-petersen commented 1 year ago

Thanks @normanmr for the help. That answered my question.