sdsc / spack

A flexible package manager that supports multiple versions, configurations, platforms, and compilers.
https://spack.io
Other
1 stars 5 forks source link

XUP-165363: PKG/SPEC - expanse/0.17.3/cpu/b - julia/intel-mkl - Install intel-mkl with threads=openmp to serve as a dependency for Julia #64

Open mkandes opened 1 year ago

mkandes commented 1 year ago

Also attempt to update Julia package from spack/spack develop to include newer versions prior to deployment to production.

mkandes commented 1 year ago
Ticket created from XUP by jde225]
[From: Jonathan Demidio]
[System: expanse.sdsc.xsede.org]
[Category: Batch Queues/Jobs]
Hello,

I have been having an issue with performance when trying to use my multithreaded Julia program on the shared partition. I use a single Julia thread, but several MKL BLAS threads to speed up linear algebra computations. The expanse documentation mentions that performance can be compromised when multithreading on the shared partition. Is it possible to set the thread affinity when using the shared partition? To give some more details, I use the following sbatch command (for instance requesting 6 threads for each job in an array of 10) :

sbatch -p shared -J jobname -a 1-10 --cpus-per-task=6 -t 1:00:00 --mem-per-cpu=2G rundqmc_jobarray.sh 6

and the script rundqmc_jobarray.sh looks like:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#!/bin/bash
#SBATCH --output=slurm-%A.out # stdout file
#SBATCH --ntasks=1 # total number of tasks across allnodes
#SBATCH --nodes=1
#SBATCH -A ukl108

echo "My SLURM_ARRAY_JOB_ID is $SLURM_ARRAY_JOB_ID."
echo "My SLURM_ARRAY_TASK_ID is $SLURM_ARRAY_TASK_ID"

julia --compiled-modules=no /expanse/lustre/projects/ukl108/jde225/sourcefolder/Run_array.jl $SLURM_ARRAY_TASK_ID $1
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Julia then sets the number of BLAS threads (6 in this case) which is passed in.

I know this workflow might be a little messy, but maybe there is just some extra flag I need to add? If not then perhaps I need to use the compute nodes, and if so I'm not sure how to change my submission script (using job arrays) so as to package things up properly into nodes (and benefit from memory locality). Thanks a lot for you assistance.

Best regards,

Jon Demidio (jde225)
mkandes commented 1 year ago
Jonathan

Have you tried affinity flags such as KMP_AFFINITY? Maybe try:

export KMP_AFFINITY=compact

and see how it does. Note that to some extent the performance on the shared nodes is beyond your control because it depends on what else is running on the node. It's possible there are other jobs on the same socket and maybe they are memory intensive and that impacts your job. So some variability in performance is to be expected on the "shared" partition.

How much variation are you seeing in performance and how far off is it from the expected performance?

Mahidhar
mkandes commented 1 year ago
Hello Mahidhar,

Thank you very much for your help. I tried this flag and it doesn't seem to work. When I don't try multithreading my runtimes are more or less consistent (it's hard to tell with the natural runtime variability of my program).

Now I have created the most basic stripped down version of my code, basically just a simple Julia code that does some matrix manipulations, and I am clearly seeing that I get no performance gain from multithreading. This even occurs when I submit to a compute node and exclusively request resources.

I can see that Julia 1.6 is using the MKL BLAS libraries, however the MKL Julia package is not installed. I'm not sure if this could cause some issue, but in any case it would be nice to have the latest version of Julia 1.7 installed, since this is the first version with the MKL package available. Would it be possible to install this software?

Thank you for your help.
Best,
Jon