Running LU with StarPU and CUDA on multiple machines

starpu-runtime / starpu

This is a mirror of https://gitlab.inria.fr/starpu/starpu where our development happens, but contributions are welcome here too!

https://starpu.gitlabpages.inria.fr/

GNU Lesser General Public License v2.1

63 stars 12 forks source link

Running LU with StarPU and CUDA on multiple machines #23

Closed TommyUW closed 1 year ago

TommyUW commented 1 year ago

Hello, Thank you for your previous help. Currently the the example code of LU MPI with StarPU and CUDA is able to run on two machines, each is equipped with two GPUS. I am trying to run this program on these two machines simultaneously so that four GPUs can be utilized. Here is my command: STARPU_SCHED=dmda STARPU_NCPU=1 OPENBLAS_NUM_THREADS=1 STARPU_WORKERS_NOBIND=1 STARPU_NCUDA=4 STARPU_NOPENCL=0 mpirun -n 4 -f mpi_config_file ./plu_example_double 8 -size 4096 -nblocks 16 -p 2 -q 2
The mpi_config_file is written with: node1:2 node2:2 However, this command doesn't work and the terminal shows that only 2 GPUs are available. How should I change my command?

Besides, on my laptop, I have encountered another interesting thing: As shown in the picture, the program is able to run. However, it seems like the program just stuck there. I have waited for five minutes but still no results. My CUDA is 9.1 and the driver is 530. Is it because the version of the driver is too high?

Thank you very much

nfurmento commented 1 year ago

What exactly does "However, this command doesn't work and the terminal shows that only 2 GPUs are available. How should I change my command?" mean ?

based on your description, each of your MPI nodes should indeed have 2 GPUs.

As for your 2nd question, I would advise to first try without setting any variable, and with more than 1 process.

TommyUW commented 1 year ago

As shown in the picture, we want the four MPI processes to connect the four GPUs (Two on each machines). However, the terminal shows that "Warning: 4 CUDA devices requested. Only 2 available." If we changed StarPU_NCUDA=2, there would be no such error. But the performance was significantly reduced. image0(1)

nfurmento commented 1 year ago

When using MPI, you have 1 StarPU process running on each node, and each StarPU process only sees the GPU devices on the node it is running on. So what you are asking is not possible.

TommyUW commented 1 year ago

When using MPI, you have 1 StarPU process running on each node, and each StarPU process only sees the GPU devices on the node it is running on. So what you are asking is not possible.

I have changed the content of mpi_config_file. It is now node1:1, node2:1 It means each machine utilize 1 CPU core. Now I am able to use one GPU from each machine to run the LU program. However, the performance is low and not all of the GPUs are used. In short, how can I change my command to utilize four GPUs on these two machines to run the program? image0(2)

TommyUW commented 1 year ago

Also, I just checked the GPU usage. It is very low. Is it because of core oversubscription when running LU on multiple machines? Moreover, even if I set NCUDA as 0, the performance become very low again.

nfurmento commented 1 year ago

Do no set any environment variable, StarPU will use all the GPUs on each node.

And please, make sure your text message is consistent with the image you put below. You said you set NCUDA to 0 but in the image it says STARPU_NCUDA=2

nfurmento commented 1 year ago

And you should also use all the CPUs on the nodes, not just only one.

TommyUW commented 1 year ago

I am so sorry about the mistake. I do not add any variables this time. Yes, the StarPU has used all the GPUs. However, the performance is still very low. I realize that my problem is not the GPU but running mpi on multiple machines. As shown in the first picture, when I run the MPI StarPU LU on a single machine with the command that I input, the performance increases. However, in the second picture if I do not add any variables or use the same command, the performance become significantly lower. The reason that I add these variables is to keep StarPU from using all the cores on CPU. I remember you said that the program with MPI will experience core oversubscription as StarPU will use all the cores automatically. So, to increase the performance of the LU program on multiple machines, what command should I input exactly? Thank you very much.

image0(4) image0(3)

TommyUW commented 1 year ago

Also, please take a look at this picture. I set NCUDA=0. The performance of this program on single machine with 2 processes is higher than on multiple machines. image0(5)

nfurmento commented 1 year ago

As i said before, when running MPI, you will get the best performances when running 1 process on 1 node, assuming the nodes are connected through a high bandwidth network. You should talk to the persons managing your cluster, and see how to get the best performances with MPI.

TommyUW commented 1 year ago

As i said before, when running MPI, you will get the best performances when running 1 process on 1 node, assuming the nodes are connected through a high bandwidth network. You should talk to the persons managing your cluster, and see how to get the best performances with MPI.

So in short, I can only use two processes with MPI in order to utilize the total four GPUs, correct? To get the best performance, each process connect with one machine, using all the cores on CPUs through StarPU. Besides, it is impossible for me to add multiple processes on two machines to increase my scalability, right?

nfurmento commented 1 year ago

The number of MPI processes has nothing to do with the GPUs. I just said the GPUs are only visible to the process running on the machine. I will close the issue as your problems with MPI are not related to StarPU.