NachoXmex commented 2 months ago

An example on how to use this library would help a lot. I want to test a non-gpu BLAS intensive application for which I have a jobfile (job) for the slurm queue system: sbatch job

job file contains:

!/bin/bash -l

SBATCH

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK export MPICH_MALLOC_FALLBACK=1

export LD_PRELOAD=/scilib-accel/scilib-dbi.so export SCILIB_DEBUG=1

ulimit -s unlimited

srun --cpu-bind=socket ./application unset LD_PRELOAD unset SCILIB_DEBUG

I get the error: srun: error while loading shared libraries: libacchost.so: cannot open shared object file: No such file or directory

What am I doing wrong?

NachoXmex commented 2 months ago

Apparently is an error related to the environment. If I submit the job file using the environment used to build the scilib I get rid of that error message. But now I the the error:

CUBLAS error: the library was not initialized in blas/NVIDIA/dgemm.c:96 dgemm time: total= 0.003476, compute= 0.003371, other= 0.000104 dsyrk time: total= 0.000253, compute= 0.000227, other= 0.000026 srun: error: nid006631: task 0: Exited with exit code 1 srun: Terminating StepId=514051.0 slurmstepd: error: STEP 514051.0 ON nid006631 CANCELLED AT 2024-09-09T11:14:35 srun: error: nid006631: tasks 1-15,18-31: Terminated srun: error: nid006631: task 16: Terminated srun: error: nid006631: task 17: Terminated srun: Force Terminated StepId=514051.0

nicejunjie commented 2 months ago

Hi Luis,
What is your test environment? What hardware do you have?

NachoXmex commented 2 months ago

Hardware: 1 node with four NVIDIA Grace-Hopper 72 ARM cores, 128 GB LPDDR 5X RAM, H100 GPU with 96 GB HBM3 memory My environment is:

printenv LD_LIBRARY_PATH=/user-environment/env/icon/lib64:/opt/cray/libfabric/1.15.2.0/lib64:/opt/cray/libfabric/1.15.2.0/lib:/user-environment/env/icon/lib HOSTTYPE=aarch64 LC_MEASUREMENT=de_CH.UTF-8 BOOST_ROOT=/user-environment/env/icon SSH_CONNECTION=XXX LESSCLOSE=lessclose.sh %s %s LC_PAPER=de_CH.UTF-8 LC_MONETARY=de_CH.UTF-8 XKEYSYMDB=/usr/X11R6/lib/X11/XKeysymDB LANG=en_US.UTF-8 LMOD_SYSTEM_NAME=XXX WINDOWMANAGER=/usr/bin/gnome LESS=-M -I -R JAVA_ROOT=/usr/lib64/jvm/java-11-openjdk-11 HOSTNAME=XXX UENV_MOUNT_FILE=/capstor/scratch/cscs/$USER/.uenv-images/images/37652229b57b5ac32cda84de1c25b7d71811874a233c80242a0b8e4cfb1b201b/store.squashfs APPS=/capstor/apps/cscs/XXX CSHEDIT=emacs GPG_TTY=/dev/pts/20 LESS_ADVANCED_PREPROCESSOR=no COLORTERM=1 UENV_VERSION=5.1.0 JAVA_HOME=/usr/lib64/jvm/java-11-openjdk-11 MACHTYPE=aarch64-suse-linux MINICOM=-c on OSTYPE=linux HDF5_PLUGIN_PATH=/user-environment/env/icon/plugins MPIF90=/user-environment/env/icon/bin/mpif90 LC_NAME=de_CH.UTF-8 XDG_SESSION_ID=47748 USER=XXX PAGER=less UENV_PREFIX=/opt/cscs/uenv MPICXX=/user-environment/env/icon/bin/mpic++ __LMOD_REF_COUNT_MODULEPATH=/etc/cscs-modules:1 MORE=-sl PWD=/users/XXX HOME=/users/XXX PELOCAL_PRGENV=1 CMAKE_PREFIX_PATH=/user-environment/env/icon:/opt/cray/xpmem/2.8.2-1.0_3.7g84a27a5.shasta:/opt/cray/libfabric/1.15.2.0 LMOD_COLORIZE=no HOST=todi-ln001 SSH_CLIENT=XXX LMOD_VERSION=8.7.31 CUDA_HOME=/user-environment/env/icon XNLSPATH=/usr/X11R6/lib/X11/nls LMOD_SETTARG_CMD=: XDG_SESSION_TYPE=tty SDK_HOME=/usr/lib64/jvm/java-11-openjdk-11 BASH_ENV=/opt/cray/pe/lmod/lmod/init/bash XDG_DATA_DIRS=/usr/share UENV_MOUNT_POINT=/user-environment PROJECT=/project/XXX NVHPC_CUDA_HOME=/user-environment/env/icon LIBGL_DEBUG=quiet JDK_HOME=/usr/lib64/jvm/java-11-openjdk-11 LC_ADDRESS=de_CH.UTF-8 PROFILEREAD=true LC_NUMERIC=de_CH.UTF-8 LMOD_sys=Linux ModuleTable001=XXX LUSTRE_JOB_ID=XXX UENV_IMG_CMD=env -u PYTHONPATH -u VIRTUAL_ENV /opt/cscs/uenv/libexec/uenv-image SCRATCH=/capstor/scratch/cscs/XXX MPICC=/user-environment/env/icon/bin/mpicc LMOD_ROOT=/opt/cray/pe/lmod SSH_TTY=/dev/pts/20 MAIL=/var/spool/mail/XXX UENV_VIEW=/user-environment:icon-wcp:icon LESSKEY=/etc/lesskey.bin TERM=xterm-256color SHELL=/usr/local/bin/bash XDG_SESSION_CLASS=user _ModuleTableSz=1 UENV_WRAPPER_CMD=/opt/cscs/uenv/libexec/uenv-wrapper SHLVL=2 G_FILENAME_ENCODING=@locale,UTF-8,ISO-8859-15,CP1252 MPIF77=/user-environment/env/icon/bin/mpif77 ACLOCAL_PATH=/user-environment/env/icon/share/aclocal:/usr/share/aclocal MANPATH=/user-environment/env/icon/share/man:/user-environment/env/icon/man:/usr/share/man:/usr/man:/opt/cray/libfabric/1.15.2.0/share/man:/opt/cray/pe/lmod/lmod/share/man:/usr/local/man:/usr/local/share/man:/usr/share/man:/usr/man LC_TELEPHONE=de_CH.UTF-8 LMOD_PREPEND_BLOCK=normal MODULEPATH=/etc/cscs-modules LOGNAME=XXX DBUS_SESSION_BUS_ADDRESS=XXX XDG_RUNTIME_DIR=/run/user/XXX MODULEPATH_ROOT=/opt/cray/pe/modulefiles JRE_HOME=/usr/lib64/jvm/java-11-openjdk-11 UENV_MOUNT_LIST=/capstor/scratch/cscs/XXX/.uenv-images/images/37652229b57b5ac32cda84de1c25b7d71811874a233c80242a0b8e4cfb1b201b/store.squashfs:/user-environment XDG_CONFIG_DIRS=/etc/xdg PATH=/user-environment/linux-sles15-neoverse_v2/gcc-12.3.0/gcc-12.3.0-yfdpfoi7qo4e7ub4l4isthtcfevf4zee/bin:/user-environment/linux-sles15-neoverse_v2/gcc-12.3.0/nvhpc-24.3-ti5vnjw2lq7oydromjw6bvnb7aliu6qa/Linux_aarch64/24.3/compilers/bin:/user-environment/env/icon/bin:/user-environment/env/icon/libexec/osu-micro-benchmarks/mpi/collective:/user-environment/env/icon/libexec/osu-micro-benchmarks/mpi/one-sided:/user-environment/env/icon/libexec/osu-micro-benchmarks/mpi/pt2pt:/user-environment/env/icon/libexec/osu-micro-benchmarks/mpi/startup:/opt/cray/libfabric/1.15.2.0/bin:/usr/bin:/bin:/users/$USER/bin:/usr/local/bin:/usr/bin:/bin:/usr/lpp/mmfs/bin:/usr/lib/mit/bin JAVA_BINDIR=/usr/lib64/jvm/java-11-openjdk-11/bin LC_IDENTIFICATION=de_CH.UTF-8 PS1=[[\e[31m]todi[\e[m]][\u@\h \W]$ MODULESHOME=/opt/cray/pe/lmod/lmod LMOD_SETTARG_FULL_SUPPORT=no PKG_CONFIG_PATH=/user-environment/env/icon/lib64/pkgconfig:/user-environment/env/icon/lib/pkgconfig:/usr/share/pkgconfig:/usr/lib64/pkgconfig:/usr/lib/pkgconfig:/opt/cray/xpmem/2.8.2-1.0_3.7g84a27a5.shasta/lib64/pkgconfig:/opt/cray/libfabric/1.15.2.0/lib64/pkgconfig:/user-environment/env/icon/share/pkgconfig G_BROKEN_FILENAMES=1 HISTSIZE=1000 LMOD_PKG=/opt/cray/pe/lmod/lmod CLUSTER_NAME=todi CPU=aarch64 SSH_SENDS_LOCALE=yes UENV_CMD=env -u PYTHONPATH -u VIRTUAL_ENV /opt/cscs/uenv/libexec/uenv-impl LMOD_CMD=/opt/cray/pe/lmod/lmod/libexec/lmod LESSOPEN=lessopen.sh %s LMOD_FULL_SETTARG_SUPPORT=no LMOD_DIR=/opt/cray/pe/lmod/lmod/libexec LC_TIME=de_CH.UTF-8

nicejunjie commented 2 months ago

Hi Luis,

Thanks for the details. I don't see anything wrong, and you are using the lib correctly. The cublas gets initialized in scilib_nvidia_init() which is called in the Linux ELF init array, so something mysteriously is disturbing that init array. Let's try two things to better understand it:

1) scilib-accel has a dgemm test code, you should see a test_dgemm.x file in scilib-accel directory, try run that simple case by : echo 1000 1000 1000 10| LD_PRELOAD=./sclib-dbi.so ./test_dgemm.x, also try the DLSYM version echo 1000 1000 1000 10| LD_PRELOAD=./sclib-dl.so ./test_dgemm.x, "1000 1000 1000" is the matrix input for dgemm, and 10 is 10 iterations.

2) there is an error check for cublas init in nvidia.c , apparently it is not working. Can you add a print statement in the scilib_nvidia_init() routine to check if that function get executed at all?

It is great that you have a Cray! I have been hoping to test GH node on Cray but don't have access.

Best Junjie

NachoXmex commented 2 months ago

Here are the outputs for point 1:

Run with -f to if you'd like to enter the full input arguments. Enter m, n, k niter: dgemm runtime(s): 0.105323 dgemm runtime(s): 0.004281 dgemm runtime(s): 0.004260 dgemm runtime(s): 0.004267 dgemm runtime(s): 0.004340 dgemm runtime(s): 0.004295 dgemm runtime(s): 0.004335 dgemm runtime(s): 0.004273 dgemm runtime(s): 0.004272 dgemm runtime(s): 0.004250 Min dgemm runtime (s): 0.004250 Avg dgemm runtime (s): 0.004286 (excl. 1st run) Result check 1000.00 1000

DLSYM version:

Run with -f to if you'd like to enter the full input arguments. Enter m, n, k niter: dgemm runtime(s): 0.020958 dgemm runtime(s): 0.004268 dgemm runtime(s): 0.004253 dgemm runtime(s): 0.004256 dgemm runtime(s): 0.004247 dgemm runtime(s): 0.004257 dgemm runtime(s): 0.004252 dgemm runtime(s): 0.004238 dgemm runtime(s): 0.004273 dgemm runtime(s): 0.004249 Min dgemm runtime (s): 0.004238 Avg dgemm runtime (s): 0.004255 (excl. 1st run) Result check 1000.00 1000

It seems to be working just fine

NachoXmex commented 2 months ago

Output of point 2:

Just to check the print statement:

echo 1000 1000 1000 10| LD_PRELOAD=./scilib-dbi.so ./test_dgemm.x scilib_nvidia_init is being executed Run with -f to if you'd like to enter the full input arguments. Enter m, n, k niter: dgemm runtime(s): 0.021414 dgemm runtime(s): 0.004267 dgemm runtime(s): 0.004246 dgemm runtime(s): 0.004246 dgemm runtime(s): 0.004233 dgemm runtime(s): 0.004243 dgemm runtime(s): 0.004245 dgemm runtime(s): 0.004231 dgemm runtime(s): 0.004254 dgemm runtime(s): 0.004231 Min dgemm runtime (s): 0.004231 Avg dgemm runtime (s): 0.004244 (excl. 1st run) Result check 1000.00 1000

Actual testing:

CUBLAS error: the library was not initialized in blas/NVIDIA/dgemm.c:96 scilib_nvidiainit is being executed dgemm time: total= 0.005841, compute= 0.005728, other= 0.000113 dsyrk_ time: total= 0.000290, compute= 0.000254, other= 0.000036 srun: error: nid006679: task 0: Exited with exit code 1 srun: Terminating StepId=578937.0 slurmstepd: error: STEP 578937.0 ON nid006679 CANCELLED AT 2024-09-16T10:34:20 srun: error: nid006679: tasks 1-12,14,16,18-20,22-31: Terminated srun: error: nid006679: tasks 13,15,17,21: Terminated srun: Force Terminated StepId=578937.0

nicejunjie commented 2 months ago

Interesting. Although no error for the simple test run, the timing indicates GPU wasn't used at all. On a Hopper, multiplying 1000x1000 matrix takes only 60 to 80 us:

c609-001.vista(1004)$ echo 1000 1000 1000 10 | LD_PRELOAD=./scilib-dbi.so ./test_dgemm.x Run with -f to if you'd like to enter the full input arguments. Enter m, n, k niter: dgemm runtime(s): 0.020648 dgemm runtime(s): 0.000084 dgemm runtime(s): 0.000068 dgemm runtime(s): 0.000066 dgemm runtime(s): 0.000064 dgemm runtime(s): 0.000065 dgemm runtime(s): 0.000065 dgemm runtime(s): 0.000065 dgemm runtime(s): 0.000064 dgemm runtime(s): 0.000064 Min dgemm runtime (s): 0.000064 Avg dgemm runtime (s): 0.000067 (excl. 1st run)

Result check 1000.00 1000

Can I ask you to try another test? I just pushed a new test to scilib-accel/explore/page_move_study/, it is cublas test code, no automatic offload but a native cublas call. Run it simply by ./run.sh: this is my output:

c609-002.vista(1040)$ ./run.sh Matrix dimensions: M=32, N=2400, K=93536 Matrix size: A 23.95 MB, B 1795.89 MB, C 0.61 MB, Total 1820.45 MB

iteration 0, CPU dgemm time : 13.037 ms, numa A B C: 0 0 0 iteration 1, CPU dgemm time : 10.423 ms, numa A B C: 0 0 0 iteration 2, CPU dgemm time : 11.350 ms, numa A B C: 0 0 0 iteration 3, CPU dgemm time : 10.389 ms, numa A B C: 0 0 0 iteration 4, CPU dgemm time : 10.370 ms, numa A B C: 0 0 0 move_page time 0.005104 of 366 pages move_page time 0.351370 of 27404 pages move_page time 0.000397 of 10 pages

iteration 0, cublasdgemm time : 17.013 ms, numa A B C: 1 1 1 iteration 1, cublasdgemm time : 0.536 ms, numa A B C: 1 1 1 iteration 2, cublasdgemm time : 0.520 ms, numa A B C: 1 1 1 iteration 3, cublasdgemm time : 0.520 ms, numa A B C: 1 1 1 iteration 4, cublasdgemm time : 0.518 ms, numa A B C: 1 1 1

iteration 0, CPU dgemm time : 14.236 ms, numa A B C: 1 1 1 iteration 1, CPU dgemm time : 13.959 ms, numa A B C: 1 1 1 iteration 2, CPU dgemm time : 15.070 ms, numa A B C: 1 1 1 iteration 3, CPU dgemm time : 13.998 ms, numa A B C: 1 1 1 iteration 4, CPU dgemm time : 14.001 ms, numa A B C: 1 1 1

The first block shows timing running CPU BLAS on LPDDR5, second block shows cuBLAS using data migrated from LPDDR5 to HBM which is 20x faster than CPU blas, last block is CPU BLAS running with data on HBM which is slower than CPU using LPDDR5 as mem bandwidth is lower.

Best Junjie

nicejunjie commented 1 month ago

Hi Luis,

I see this email but couldn’t find the post on github, let me try to reply if the email reply works here.

I see two issues for the simple cublas test case. First, for the initial CPU run, all data should be on the LPDDR5 but yours is showing on HBM (the number 1 in the end of each iteration indicates where the data is -- NUMA 0 (LPDDR5) or NUMA 1 (HBM).
Second, your GPU time is over 17 ms while it should be only 0.5ms.

So something else on your system is not working right. Let me know what you think, and I’m interested to know what is happening.

Best Junjie

Junjie Li

On Sep 18, 2024, at 2:39 AM, Luis I Hernández Segura @.***> wrote:

./run.sh Lmod has detected the following error: The following module(s) are unknown: "nvhpc-hpcx-cuda12/24.7"

Please check the spelling or version number. Also try "module spider ..." It is also possible your cache file is out-of-date; it may help to try: $ module --ignore_cache load "nvhpc-hpcx-cuda12/24.7"

Also make sure that all modulefiles written in TCL start with the string #%Module

Matrix dimensions: M=32, N=2400, K=93536 Matrix size: A 23.95 MB, B 1795.89 MB, C 0.61 MB, Total 1820.45 MB

iteration 0, CPU dgemm time : 47.019 ms, numa A B C: 1 1 1 iteration 1, CPU dgemm time : 22.705 ms, numa A B C: 1 1 1 iteration 2, CPU dgemm time : 18.560 ms, numa A B C: 1 1 1 iteration 3, CPU dgemm time : 17.344 ms, numa A B C: 1 1 1 iteration 4, CPU dgemm time : 17.570 ms, numa A B C: 1 1 1 move_page time 0.010955 of 366 pages move_page time 0.500337 of 27404 pages move_page time 0.002454 of 10 pages

iteration 0, cublasdgemm time : 149.980 ms, numa A B C: 1 1 1 iteration 1, cublasdgemm time : 32.462 ms, numa A B C: 1 1 1 iteration 2, cublasdgemm time : 32.475 ms, numa A B C: 1 1 1 iteration 3, cublasdgemm time : 32.426 ms, numa A B C: 1 1 1 iteration 4, cublasdgemm time : 32.496 ms, numa A B C: 1 1 1

iteration 0, CPU dgemm time : 59.940 ms, numa A B C: 1 1 1 iteration 1, CPU dgemm time : 52.387 ms, numa A B C: 1 1 1 iteration 2, CPU dgemm time : 37.634 ms, numa A B C: 1 1 1 iteration 3, CPU dgemm time : 37.887 ms, numa A B C: 1 1 1 iteration 4, CPU dgemm time : 38.406 ms, numa A B C: 1 1 1

— Reply to this email directly, view it on GitHub https://github.com/nicejunjie/scilib-accel/issues/2#issuecomment-2357729800, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADHAVEIIB7E6PT75EP7TCL3ZXEU3VAVCNFSM6AAAAABN4B2HPWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJXG4ZDSOBQGA. You are receiving this because you commented.

nicejunjie commented 1 month ago

Meanwhile, if you like, you can send me your code and a workload. I can test it on our system. I have been looking for codes that can benefit from this tool.

NachoXmex commented 1 month ago

I deleted the post because I just ran the script without modifying the paths, for example your scratch directory. I will adapt it and test it later. How can I share the code?

NachoXmex commented 1 month ago

Modules loaded are 23.9:

!/bin/bash

ml load cray ml load nvhpc-hpcx-cuda12/23.9 ml load cudatoolkit/23.9_12.2

CUDA_HOME=/opt/nvidia/hpc_sdk/Linux_aarch64/23.9 nvc -Mnvpl dgemm-cmp3.c -mp -cuda -I${CUDA_HOME}/cuda/include -I${CUDA_HOME}/math_libs/include/ -L${CUDA_HOME}/math_libs/lib64 -lcublas -lnuma

export OMP_NUM_THREADS=72

./a.out 32 2400 93536 5

./run.sh nvc-Error-Unknown switch: -Mnvpl Matrix dimensions: M=32, N=2400, K=93536 Matrix size: A 23.95 MB, B 1795.89 MB, C 0.61 MB, Total 1820.45 MB

iteration 0, CPU dgemm time : 14.965 ms, numa A B C: 0 0 0 iteration 1, CPU dgemm time : 11.309 ms, numa A B C: 0 0 0 iteration 2, CPU dgemm time : 12.111 ms, numa A B C: 0 0 0 iteration 3, CPU dgemm time : 11.457 ms, numa A B C: 0 0 0 iteration 4, CPU dgemm time : 11.472 ms, numa A B C: 0 0 0 move_page time 0.008147 of 366 pages move_page time 0.435758 of 27404 pages move_page time 0.206989 of 10 pages

iteration 0, cublasdgemm time : 53.770 ms, numa A B C: 1 1 1 iteration 1, cublasdgemm time : 32.328 ms, numa A B C: 1 1 1 iteration 2, cublasdgemm time : 32.311 ms, numa A B C: 1 1 1 iteration 3, cublasdgemm time : 32.305 ms, numa A B C: 1 1 1 iteration 4, cublasdgemm time : 32.309 ms, numa A B C: 1 1 1

iteration 0, CPU dgemm time : 37.979 ms, numa A B C: 1 1 1 iteration 1, CPU dgemm time : 38.666 ms, numa A B C: 1 1 1 iteration 2, CPU dgemm time : 39.328 ms, numa A B C: 1 1 1 iteration 3, CPU dgemm time : 38.745 ms, numa A B C: 1 1 1 iteration 4, CPU dgemm time : 38.552 ms, numa A B C: 1 1 1

nicejunjie commented 1 month ago

nvhpc 23.9 doesn't recognize -Mnvpl flag which links to the nvidia performance library for Grace CPU, it surprises me that it compiles at all.

cuBLAS time is still unreasonable, so your Grace-Hopper is definitely not setup properly, I suspect memory bandwidth is bad, is there anyone in your institution doing performance testing on these nodes? Running a STREAM test may help.

Also, using the latest NVHPC compiler and GPU driver will help, NVIDIA has done quite a bit of work for Grace-Hopper during recent releases, while NVHPC/23.9 can be buggy.

nicejunjie / scilib-accel

scilib-accel on Cray #2

job file contains:

!/bin/bash -l

SBATCH

Just to check the print statement:

Actual testing:

Result check 1000.00 1000

!/bin/bash