Closed Muxas closed 3 months ago
We have never seen such a thing. Are you sure it's not the kernel that takes that much amount of time?
It could be useful to inspect with gdb or such in what situation threads usually are on the bogus GPUs, to check what is hanging.
It seems to me, there is some race to acquire some mutex.
That seems very unlikely to me, we have not seen such a thing. Perhaps try with the lws scheduler that really minimizes interaction between workers.
Are you using scheduling contexts?
We shifted from StarPU version 1.3.11 to version 1.4.7 and the issue has gone away. Issue is not related to the CUDA kernel of code let, as it simply calls cuBLAS. I believe that time tracking of codelets was substantially reworked in 1.4 version and the issue is no more.
Even without this example, I saw some troubles with certain kernels as soon as memory consumption was a little larger, than a GPU could handle (set by environment variable). In this case time was growing up 100x, as if GPU was actually using CPU memory through a unified memory mechanism.
I believe that time tracking of codelets was substantially reworked in 1.4 version and the issue is no more
Well, no, the time tracking has not changed at all...
Hi! I have noticed, that performance of cuBLAS matrix multiplication of a specific size (4096x2560 @ 2560x2560 -> 4096x2560, TF32 type) on a server with 8x GPUs H100 is measured incorrectly by StarPU-1.3.11 in certain circumstances. Judging by a perfmodel file from .starpu directory, one-two GPUs perform at 280 Tflops/s, while others perform at 0.5-5 Tflops/s. I experienced something similar before, but not at such a scale. Due to such a behavior, all matrix multiplications are scheduled on 1-2 GPUs with correct timings and others are simply idle. What can influence measurements? I tried
STARPU_WORKERS_NOBIND=1
but it did not help.Normal execution time: around 190 usec Abnormal execution time: around 70-100 msec
It seems to me, there is some race to acquire some mutex.
Here is an example codelet sampling file: