xc308 commented 6 months ago

Container image name

rocker/cuda:4.3.3

Container image digest

No response

What operating system are you seeing the problem on?

Linux

System information

Linux bask-pg-login01.cluster.baskerville.ac.uk 4.18.0-513.11.1.el8_9.x86_64 #1 SMP Thu Dec 7 03:06:13 EST 2023 x86_64 x86_64 x86_64 GNU/Linux
[fwzp1184@bask-pg-login01 XC_Work]$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 144 On-line CPU(s) list: 0-143 Thread(s) per core: 2 Core(s) per socket: 36 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 106 Model name: Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz Stepping: 6 CPU MHz: 2400.000 BogoMIPS: 4800.00 Virtualization: VT-x L1d cache: 48K L1i cache: 32K L2 cache: 1280K L3 cache: 55296K NUMA node0 CPU(s): 0-35,72-107 NUMA node1 CPU(s): 36-71,108-143
[fwzp1184@bask-pg-login01 XC_Work]$ cat /proc/meminfo MemTotal: 527954288 kB MemFree: 502250764 kB MemAvailable: 499490632 kB Buffers: 5284 kB Cached: 5043180 kB SwapCached: 21844 kB Active: 4296360 kB Inactive: 9620012 kB Active(anon): 3348860 kB Inactive(anon): 9000292 kB Active(file): 947500 kB Inactive(file): 619720 kB Unevictable: 4207544 kB Mlocked: 4207544 kB SwapTotal: 33554428 kB SwapFree: 32450556 kB Dirty: 188 kB Writeback: 0 kB AnonPages: 13041088 kB Mapped: 3214484 kB Shmem: 3476772 kB KReclaimable: 1094028 kB Slab: 2450656 kB SReclaimable: 1094028 kB SUnreclaim: 1356628 kB KernelStack: 62560 kB PageTables: 193820 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 297531572 kB Committed_AS: 14419992 kB VmallocTotal: 13743895347199 kB VmallocUsed: 3079888 kB VmallocChunk: 0 kB Percpu: 372672 kB HardwareCorrupted: 0 kB AnonHugePages: 7432192 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB FileHugePages: 0 kB FilePmdMapped: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 0 kB DirectMap4k: 4634816 kB DirectMap2M: 144955392 kB DirectMap1G: 389021696 kB

Bug description

I recently encountered a strange error when I submitted my job to HPC, saying, "_OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions. This library was built to support a maximum of 128 threads - either rebuild OpenBLAS OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions. With a larger NUM_THREADS value or, set the environment variable OPENBLAS_NUMTHREADS to a sufficiently small number. This error typically occurs when the software that relies on OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more CPU cores than what OpenBLAS was configured to handle."

I have never encountered such an error when I use simulation data of size 200*5, and the precision matrices are 200*5 by 200*5. But I get this error when I use actual data of size around 3800*5 by 3800*5.

My code offloads giant matrices multiplications to 1 GPU node and will only return the neg-log likelihood scalar, whose calculation processes are all on GPU, back to CPU for the following optimization.

After encountering such an error, I followed the instructions of the error and set the environment variable at the beginning of my R scripts. I have tried to set Sys.setenv(OPENBLAS_NUM_THREADS = "126") Sys.setenv(OPENBLAS_NUM_THREADS = "1") but they all gave me exactly the same error as those mentioned above. When I tried Sys.getenv("OPENBLAS_NUM_THREADS"), I got an empty result, [1] "".

So, I'm wondering whether the OPENBLAS library enclosed in the cuda/4.3.3.sif will ever honour the environment variable OPENBLAS_NUM_THREADS. It gave me a feeling that OPENBLAS won't change its threads no matter how small I set the environment variable.

In the terminal, I typed echo $OPENBLAS_NUM_THREADS and got 120.

In my slurm job description, I set job allocation parameters as below:

SBATCH --nodes=1

SBATCH --ntasks-per-node=1

SBATCH --gpus-per-task=1

SBATCH --cpus-per-gpu=36

And the Rscript run command is: apptainer exec --nv ../cuda_4.3.3.sif Rscript 064a_Optm_GPU_Lon_Strip_1.R

How to reproduce this bug?

To reproduce the error, 
the code submitted to HPC can be found here: https://github.com/xc308/XC_Work/blob/main/064a_Optm_GPU_Lon_Strip_1.R
The data used in the code is df_Lon_Strip_1_Sort.rds, and is in the repository.

The code with simulated data that has run successfully without such an error can be found here: 
https://github.com/xc308/XC_Work/blob/main/060_2D_Inf_neg_logL_CAR_GPU.R

The simulation data used is df_2D_TW_CAMS.rds, and is in the repository as well.

xc308 commented 6 months ago

Full error output is below:

r: 2 OpenBLAS warning: precompiled NUM_THREADS exceeded, adding auxiliary array for thread metadata. OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions. OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions. This library was built to support a maximum of 128 threads - either rebuild OpenBLAS OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions. This library was built to support a maximum of 128 threads - either rebuild OpenBLAS OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions. This library was built to support a maximum of 128 threads - either rebuild OpenBLAS with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions. This library was built to support a maximum of 128 threads - either rebuild OpenBLAS with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to a sufficiently small number. This error typically occurs when the software that relies on OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more cpu cores than what OpenBLAS was configured to handle. a sufficiently small number. This error typically occurs when the software that relies on with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more cpu cores than what OpenBLAS was configured to handle. OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions. This library was built to support a maximum of 128 threads - either rebuild OpenBLAS OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions. This library was built to support a maximum of 128 threads - either rebuild OpenBLAS with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to a sufficiently small number. This error typically occurs when the software that relies on OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more cpu cores than what OpenBLAS was configured to handle. OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions. with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to a sufficiently small number. This error typically occurs when the software that relies on This library was built to support a maximum of 128 threads - either rebuild OpenBLAS OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions. This library was built to support a maximum of 128 threads - either rebuild OpenBLAS with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to a sufficiently small number. This error typically occurs when the software that relies on OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more cpu cores than what OpenBLAS was configured to handle. OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions. OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions. This library was built to support a maximum of 128 threads - either rebuild OpenBLAS with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to a sufficiently small number. This error typically occurs when the software that relies on OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more cpu cores than what OpenBLAS was configured to handle. OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions. This library was built to support a maximum of 128 threads - either rebuild OpenBLAS OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions. with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to a sufficiently small number. This error typically occurs when the software that relies on OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more cpu cores than what OpenBLAS was configured to handle. OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions. This library was built to support a maximum of 128 threads - either rebuild OpenBLAS This library was built to support a maximum of 128 threads - either rebuild OpenBLAS with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to a sufficiently small number. This error typically occurs when the software that relies on OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more cpu cores than what OpenBLAS was configured to handle. OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions. This library was built to support a maximum of 128 threads - either rebuild OpenBLAS with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to a sufficiently small number. This error typically occurs when the software that relies on OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more cpu cores than what OpenBLAS was configured to handle. with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to a sufficiently small number. This error typically occurs when the software that relies on OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more cpu cores than what OpenBLAS was configured to handle. OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions. This library was built to support a maximum of 128 threads - either rebuild OpenBLAS OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions. with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to a sufficiently small number. This error typically occurs when the software that relies on OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to a sufficiently small number. This error typically occurs when the software that relies on OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more cpu cores than what OpenBLAS was configured to handle. OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions. This library was built to support a maximum of 128 threads - either rebuild OpenBLAS with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to a sufficiently small number. This error typically occurs when the software that relies on OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more cpu cores than what OpenBLAS was configured to handle. cpu cores than what OpenBLAS was configured to handle. with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to a sufficiently small number. This error typically occurs when the software that relies on OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more cpu cores than what OpenBLAS was configured to handle. OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions. This library was built to support a maximum of 128 threads - either rebuild OpenBLAS with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to a sufficiently small number. This error typically occurs when the software that relies on OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more cpu cores than what OpenBLAS was configured to handle. This library was built to support a maximum of 128 threads - either rebuild OpenBLAS with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to a sufficiently small number. This error typically occurs when the software that relies on OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more cpu cores than what OpenBLAS was configured to handle. OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions. This library was built to support a maximum of 128 threads - either rebuild OpenBLAS with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to a sufficiently small number. This error typically occurs when the software that relies on OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more cpu cores than what OpenBLAS was configured to handle. a sufficiently small number. This error typically occurs when the software that relies on OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more cpu cores than what OpenBLAS was configured to handle. OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions. This library was built to support a maximum of 128 threads - either rebuild OpenBLAS with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to a sufficiently small number. This error typically occurs when the software that relies on OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more cpu cores than what OpenBLAS was configured to handle. OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more cpu cores than what OpenBLAS was configured to handle. OpenBLAS : Program is Terminated. Because you tried to allocate too many memory regions. This library was built to support a maximum of 128 threads - either rebuild OpenBLAS with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to a sufficiently small number. This error typically occurs when the software that relies on OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more cpu cores than what OpenBLAS was configured to handle. This library was built to support a maximum of 128 threads - either rebuild OpenBLAS with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to a sufficiently small number. This error typically occurs when the software that relies on OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more cpu cores than what OpenBLAS was configured to handle. This library was built to support a maximum of 128 threads - either rebuild OpenBLAS with a larger NUM_THREADS value or set the environment variable OPENBLAS_NUM_THREADS to a sufficiently small number. This error typically occurs when the software that relies on OpenBLAS calls BLAS functions from many threads in parallel, or when your computer has more cpu cores than what OpenBLAS was configured to handle.

caught segfault address (nil), cause 'memory not mapped'

Traceback: 1: (function (self) { .Call(_torch_cpp_torch_namespace_linalg_eig_self_Tensor, self)})(self = <pointer: 0x561f81ed45c0>) 2: do.call(fun, args) 3: do_call(f, args) 4: call_c_function(fun_name = "linalg_eig", args = args, expected_types = expected_types, nd_args = nd_args, return_types = return_types, fun_type = "namespace") 5: torch_linalg_eig(A) 6: torch::linalg_eig(x@gm) 7: .local(x) 8: eigen(cov_mat, symmetric = T, only.values = T) 9: eigen(cov_mat, symmetric = T, only.values = T) 10: check_pd_gpu(SG_inv_gpu) 11: TST12_SG_SGInv_CAR_2D_GPU(p = p, data = data_str, A_mat = all_pars_lst[[1]], dsp_lon_mat = dsp_lon_mat, dsp_lat_mat = dsp_lat_mat, dlt_lon_mat = all_pars_lst[[2]], dlt_lat_mat = all_pars_lst[[3]], b = b, phi = phi, H_adj = H_adj, sig2_mat = all_pars_lst[[4]], reg_ini = 1e-09, thres_ini = 0.001) 12: fn(par, ...) 13: (function (par) fn(par, ...))(c(0.2, 0.2, 0.2, 0.2, 0.2, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.1, 0.1, 0.1, 0.1, 0.1)) 14: optim(par = all_ini_Vals, fn = neg_logL_CAR_2D_GPU, p = p, data_str = hierarchy_data_CAMS, all_pars_lst = all_pars_lst_CAR_2D_CMS, dsp_lon_mat = DSP[, , 1], dsp_lat_mat = DSP[, , 2], b = "Tri-Wave", phi = phi, H_adj = H_adj, df = df_Lon_Strp_1_Srt, method = "L-BFGS-B", lower = lower_bound, control = list(maxit = 200, factr = 0.01/.Machine$double.eps)) An irrecoverable exception occurred. R is aborting now ... /var/spool/slurmd/job742090/slurm_script: line 14: 3946568 Segmentation fault (core dumped) apptainer exec --nv ../cuda_4.3.3.sif Rscript 064a_Optm_GPU_Lon_Strip_1.R

benz0li commented 6 months ago

@xc308 Can you try to limit by setting OMP_NUM_THREADS?

benz0li commented 6 months ago

and search for similar issues at https://github.com/OpenMathLib/OpenBLAS/issues.

benz0li commented 6 months ago

ℹ️ https://github.com/OpenMathLib/OpenBLAS?#setting-the-number-of-threads-using-environment-variables

Most likely, PyTorch[^1] is using an OpenMP-enabled OpenBLAS library [which is not the system's OpenBLAS library].

[^1]: _torch_cpp_torch_namespace_linalg_eig_self_Tensor points to PyTorch

xc308 commented 6 months ago

@benz0li Hi, I use R not python. I run my Rscript using apptainer apptainer exec --nv ../cuda_4.3.3.sif Rscript 064a_Optm_GPU_Lon_Strip_1.R. Do you know how to change the environment variable for the apptainer?

benz0li commented 6 months ago

Do you know how to change the environment variable for the apptainer?

https://apptainer.org/docs/user/main/environment_and_metadata.html

benz0li commented 6 months ago

I use R not python.

Some R packages use Python in the background, e.g. packages tensorflow, torch, etc.

What R packages are you using?

xc308 commented 6 months ago

Do you know how to change the environment variable for the apptainer?

https://apptainer.org/docs/user/main/environment_and_metadata.html

Yeah, this is the what I was just reading, and I think I managed to solve the problem. I modified my env variable for the apptainer by adding flag --env.

apptainer exec --nv --env OPENBLAS_NUM_THREADS=1 ../cuda_4.3.3.sif Rscript hello.R

Now the code has been running for almost 1 hours and no error so far.

xc308 commented 6 months ago

I use R not python.

Some R packages use Python in the background, e.g. packages tensorflow, torch, etc.

What R packages are you using?

I use cuda_4.3.3.sif, I'm not sure what R version is, should be the most updated one.

xc308 commented 6 months ago

I also tried to set the OPENBLAS_NUM_THREADS to 5, 10, but all got the same errors. Do you know why only OPENBLAS_NUM_THREADS=1 works? And what will be the impact of setting it to 1?

benz0li commented 6 months ago

I use R not python.

Some R packages use Python in the background, e.g. packages tensorflow, torch, etc. What R packages are you using?

I use cuda_4.3.3.sif, I'm not sure what R version is, should be the most updated one.

You are using R v4.3.3, then.

But what packages are you loading with library in your R script?

xc308 commented 6 months ago

I use R not python.

Some R packages use Python in the background, e.g. packages tensorflow, torch, etc. What R packages are you using?

I use cuda_4.3.3.sif, I'm not sure what R version is, should be the most updated one.

You are using R v4.3.3, then.

But what packages are you loading with library in your R script?

I load library(Matrix) library(torch) library(GPUmatrix)

benz0li commented 6 months ago

I use R not python.

Some R packages use Python in the background, e.g. packages tensorflow, torch, etc. What R packages are you using?

I use cuda_4.3.3.sif, I'm not sure what R version is, should be the most updated one.

You are using R v4.3.3, then. But what packages are you loading with library in your R script?

I load library(Matrix) library(torch) library(GPUmatrix)

library(torch), i.e. package torch, uses the 'libtorch' library.

See https://torch.mlverse.org/docs/reference/threads about setting/getting the number of threads in your R script.

xc308 commented 6 months ago

I use R not python.

Some R packages use Python in the background, e.g. packages tensorflow, torch, etc. What R packages are you using?

I use cuda_4.3.3.sif, I'm not sure what R version is, should be the most updated one.

You are using R v4.3.3, then. But what packages are you loading with library in your R script?

I load library(Matrix) library(torch) library(GPUmatrix)

library(torch), i.e. package torch, uses the 'libtorch' library.

See https://torch.mlverse.org/docs/reference/threads about setting/getting the number of threads in your R script.

Ah, thank you very much about this useful information!

I use the torch_get_num_interop_threads()

torch_get_num_threads() and obtained the 72 for inter op threads and 36 for intra op threads.

However, I'm not entirely understand given my slurm parameter settings:

SBATCH --nodes=1

SBATCH --ntasks-per-node=1

SBATCH --gpus-per-task=1

SBATCH --cpus-per-gpu=36

When inter op has grabbed all the threads I requested (72 threads), why would the intra op still have 36 threads?

xc308 commented 6 months ago

I use R not python.

Some R packages use Python in the background, e.g. packages tensorflow, torch, etc. What R packages are you using?

I use cuda_4.3.3.sif, I'm not sure what R version is, should be the most updated one.

You are using R v4.3.3, then. But what packages are you loading with library in your R script?

I load library(Matrix) library(torch) library(GPUmatrix)

library(torch), i.e. package torch, uses the 'libtorch' library. See https://torch.mlverse.org/docs/reference/threads about setting/getting the number of threads in your R script.

Ah, thank you very much about this useful information!

I use the torch_get_num_interop_threads()

torch_get_num_threads() and obtained the 72 for inter op threads and 36 for intra op threads.

However, I'm not entirely sure that I understand given my slurm parameter settings: #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --gpus-per-task=1 #SBATCH --cpus-per-gpu=36

When inter op has grabbed all the threads I requested (72 threads), why would the intra op still have 36 threads?

Each CPU has 2 threads on my HPC by the way.

xc308 commented 6 months ago

In additon, I'm thinking if it's the problem of the 72 inter op threads.

Suppose my algorithm has 50 steps, and the first 25 steps are large matrices multiplication done only on CPU, while the rest of the 25 steps are offloaded to GPU.

I'm not sure whether the OPENBLAS error is because the inter op has grabbed all the available threads (72) I requested, and so there is no threads left for OPENBLAS to do the routine matrix multiplications parallelized over different CPUs.

If this understanding is correct, then it looks like instead of force OPENBLAS to work on a single CPU by setting env variable OPENBLAS_NUM_THREADS=1, which will be very slow for the first 25 steps of large matrices multiplications done on CPU, I could less the inter op threads, as there are not many tasks to be parallelized (ntasks-per-node=1), and spare more CPUs for OPENBLAS.

Please kindly advise. Thank you very much in advance!

xc308 commented 6 months ago

In additon, I'm thinking if it's the problem of the 72 inter op threads.

Suppose my algorithm has 50 steps, and the first 25 steps are large matrices multiplication done only on CPU, while the rest of the 25 steps are offloaded to GPU.

I'm not sure whether the OPENBLAS error is because the inter op has grabbed all the available threads (72) I requested, and so there is no threads left for OPENBLAS to do the routine matrix multiplications parallelized over different CPUs.

If this understanding is correct, then it looks like instead of force OPENBLAS to work on a single CPU by setting env variable OPENBLAS_NUM_THREADS=1, which will be very slow for the first 25 steps of large matrices multiplications done on CPU, I could less the inter op threads, as there are not many tasks to be parallelized (ntasks-per-node=1), and spare more CPUs for OPENBLAS.

Please kindly advise. Thank you very much in advance!

I decreased the number of interop threads to 2 (default is 72), and intra op threads to 18 (default is 36), and use set env variable OPENBLAS_NUM_THREADS = 2, but still got the same error.

xc308 commented 6 months ago

In additon, I'm thinking if it's the problem of the 72 inter op threads. Suppose my algorithm has 50 steps, and the first 25 steps are large matrices multiplication done only on CPU, while the rest of the 25 steps are offloaded to GPU. I'm not sure whether the OPENBLAS error is because the inter op has grabbed all the available threads (72) I requested, and so there is no threads left for OPENBLAS to do the routine matrix multiplications parallelized over different CPUs. If this understanding is correct, then it looks like instead of force OPENBLAS to work on a single CPU by setting env variable OPENBLAS_NUM_THREADS=1, which will be very slow for the first 25 steps of large matrices multiplications done on CPU, I could less the inter op threads, as there are not many tasks to be parallelized (ntasks-per-node=1), and spare more CPUs for OPENBLAS. Please kindly advise. Thank you very much in advance!

I decreased the number of interop threads to 2 (default is 72), and intra op threads to 18 (default is 36), and use set env variable OPENBLAS_NUM_THREADS = 2, but still got the same error.

I also set OMP_NUM_THREADS=2 on top of OPENBLAS_NUM_THREADS = 2 given interop threads = 2, intra threads = 18, but still got the same error.

xc308 commented 6 months ago

In additon, I'm thinking if it's the problem of the 72 inter op threads. Suppose my algorithm has 50 steps, and the first 25 steps are large matrices multiplication done only on CPU, while the rest of the 25 steps are offloaded to GPU. I'm not sure whether the OPENBLAS error is because the inter op has grabbed all the available threads (72) I requested, and so there is no threads left for OPENBLAS to do the routine matrix multiplications parallelized over different CPUs. If this understanding is correct, then it looks like instead of force OPENBLAS to work on a single CPU by setting env variable OPENBLAS_NUM_THREADS=1, which will be very slow for the first 25 steps of large matrices multiplications done on CPU, I could less the inter op threads, as there are not many tasks to be parallelized (ntasks-per-node=1), and spare more CPUs for OPENBLAS. Please kindly advise. Thank you very much in advance!

I decreased the number of interop threads to 2 (default is 72), and intra op threads to 18 (default is 36), and use set env variable OPENBLAS_NUM_THREADS = 2, but still got the same error.

I also set OMP_NUM_THREADS=2 on top of OPENBLAS_NUM_THREADS = 2 given interop threads = 2, intra threads = 18, but still got the same error.

I set interop threads = 2, intra threads =2, OMP_NUM_THREADS=2, OPENBLAS_NUM_THREADS = 2, but got the same error. FYI.

eitsupi commented 6 months ago

@cboettig Could you take a look at this?

xc308 commented 6 months ago

I also tried this library(RhpcBLASctl) blas_get_num_procs() # 36 blas_set_num_threads(48) and modify the slurm parameters to SBATCH --cpus-per-gpu=48 but still got the same error. FYI.

cboettig commented 6 months ago

@xc308 can you try this on rocker/rstudio or similar image from the versioned stack for comparison?

I'm unclear why you are using the cuda images here. The cuda images should indeed have support for NVBLAS (you have to opt into it and not extensively tested), if you do want to leverage GPU. But unless I'm missing something it seems you are just using CPU with openblas, which should work out of the box and the standard rocker/r-ver , rocker/rstudio series.

Can you show the output of sessionInfo() as well? Also, please test if openblas is working for you on some standard linear algebra before we worry about the torch bindings.

I recommend these examples (which also indicate how to opt in for NVBLAS if you want GPU-accelerated linear algebra -- note that it is not always faster, depends on both your hardware and the overhead in copying data onto GPU...) https://github.com/rocker-org/ml/blob/master/examples/test_blas.R

xc308 commented 6 months ago

"But unless I'm missing something it seems you are just using CPU with openblas, "

No, if my algorithm has 50 steps, the first 25 steps are done on CPU, but the rest of 25 steps are offloaded to GPU, so I do need the cuda image here.

"Can you show the output of sessionInfo() as well?"

Check Current BLAS Library R version 4.3.3 (2024-02-29) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.04.3 LTS

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: Etc/UTC tzcode source: system (glibc)

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] RhpcBLASctl_0.23-42 torch_0.12.0 Matrix_1.6-5

loaded via a namespace (and not attached): [1] processx_3.8.4 bit_4.0.5 compiler_4.3.3 magrittr_2.0.3 cli_3.6.2
[6] Rcpp_1.0.12 bit64_4.0.5 coro_1.0.4 grid_4.3.3 callr_3.7.6
[11] ps_1.7.6 rlang_1.1.3 lattice_0.22-5

"please test if openblas is working for you on some standard linear algebra "

I did test on the openblas, it GPU node is required, the blas threads will automatically be 36 (the same as intra op threads). In such case, I have to set the env var OPENBLAS_NUM_THREADS to 1, any other number will throw me the same error as reported above.

"if you want GPU-accelerated linear algebra"

Since the first 25 steps of algorithm involves few loops, so it's not most ideal to offload them to GPU but instead leave them stay on CPU. That's why I'm thinking to increase the BLAS threads to try to speed up the calculation of this part.

cboettig commented 6 months ago

@xc308 thanks. I understand you are running a complex algorithm with many steps and it is not working as expected. When trying to debug code, it is helpful to try and reproduce the problem with a minimal example rather than attempt to debug a complex algorithm with many steps and interleaved CPU & GPU dispatch. Please see the simple matrix multiplication examples in the tests I linked above, and see if they are working as expected. If they are not, we can try and debug. If they are working as expected for you on both standard and cuda images, then we will need to further isolate the issue, as it is not specifically an issue with openblas configuration. If that is the case, then please proceed to identify a minimal reproducible example that we can run to generate the behavior you are seeing. Hope this helps.

rocker-org / rocker-versioned2

OPENBLAS error in cuda_4.3.3.sif #820

Container image name

Container image digest

What operating system are you seeing the problem on?

System information

Bug description

SBATCH --nodes=1

SBATCH --ntasks-per-node=1

SBATCH --gpus-per-task=1

SBATCH --cpus-per-gpu=36

How to reproduce this bug?

SBATCH --nodes=1

SBATCH --ntasks-per-node=1

SBATCH --gpus-per-task=1

SBATCH --cpus-per-gpu=36