starpu-runtime / starpu

This is a mirror of https://gitlab.inria.fr/starpu/starpu where our development happens, but contributions are welcome here too!
https://starpu.gitlabpages.inria.fr/
GNU Lesser General Public License v2.1
58 stars 13 forks source link

Blocked LU with StarPU-MPI #7

Closed TommyUW closed 1 year ago

TommyUW commented 1 year ago

To whom it may concern: . I am trying to run the example of MPI blocked LU with StarPU. However, the running time kept increasing as the number of processes increased and sometimes even if the time decreased, it didn’t decrease significantly. What did this happen? StarPU mpi blocked LU

sthibaul commented 1 year ago

It depends a lot on the details of the platform, your execution script, etc.

Your output also shows a very low GFlop/s result, so something really odd seems to be happening on your machines.

TommyUW commented 1 year ago

It depends a lot on the details of the platform, your execution script, etc.

Your output also shows a very low GFlop/s result, so something really odd seems to be happening on your machines.

It depends a lot on the details of the platform, your execution script, etc.

Your output also shows a very low GFlop/s result, so something really odd seems to be happening on your machines.

Thank you for your reply. I run this code on my laptop, my groupmate's laptop and the cluster machine in the lab. However, all of them showed that the running time became slower as the number of process increased. Here are all the commands that I have input before running this example: Install MPI Setting up StarPU: $ apt-cache search starpu $ sudo apt-get install libstarpu-1.3 libstarpu-dev $ ./autogen.sh $ ./configure $ mkdir build $ cd build $ ./configure $ make $ make install $ export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:$STARPU_PATH/lib/pkgconfig Then I got into the mpi file in starpu-master. I run the code of mpi_lu with the command: mpirun -n 4 ./plu_example_double 8 -size 4096 -nblocks 8 -p 4 -q 1. If the value of q is more than 1, then the running time is longer. Besides, I changed the thread of OpenMP with the commad: export OMP_NUM_THREADS = from 1 to 16. The running time still didn't shorten. Would you please read the procedure above? Are there any steps that I am missing? Thank you very much.

sthibaul commented 1 year ago

I run this code on my laptop, my groupmate's laptop and the cluster machine in the lab

But are you sure that they do get used? Does top show that they indeed get to use ample CPU time there, and no other program is running?

-size 4096

This size is quite small, better use larger matrices.

I changed the thread of OpenMP with the commad: export OMP_NUM_THREADS = from 1 to 16. The running time still didn't shorten.

The lu example does not support parallel tasks, so the number of openmp thread should be kept to 1.

You can also check with fxt traces whether the load balance is correct.

TommyUW commented 1 year ago

Thank you for your reply. We are sure that no other program is running on our machines. We tried to run a matix with 40960x40960 this time. However, the performance is still not good: single process: 20230.978518 two processes: 837848.915379 four processes: 541028.325787 We have tested matrixes with various of sizes. But no matter what sizes of them are, the running time of two processes is always slower than the single process. The load balance is correct. This whole set of program we used is included in the example file in StarPU, which program is writtern by StarPU developers. We basically followed the StarPU manual and installed everything it required. We were able to compile and run StarPU program successfully. However, the performance of all LU program from the MPI_StarPU example wasn't good.

nfurmento commented 1 year ago

Le 22/02/2023 à 07:45, tianyzhao1 a écrit :

Thank you for your reply. We are sure that no other program is running on our machines. We tried to run a matix with 40960x40960 this time. However, the performance is still not good: single process: 20230.978518 two processes: 837848.915379 four processes: 541028.325787 We have tested matrixes with various of sizes. But no matter what sizes of them are, the running time of two processes is always slower than the single process. The load balance is correct. This whole set of program we used is included in the example file in StarPU, which program is writtern by StarPU developers. We basically followed the StarPU manual and installed everything it required. We were able to compile and run StarPU program successfully. However, the performance of all LU program from the MPI_StarPU example wasn't good.

— Reply to this email directly, view it on GitHub https://github.com/starpu-runtime/starpu/issues/7#issuecomment-1439515347, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEBYSUWMHG32ADM43U6RBMDWYWYZ5ANCNFSM6AAAAAAVAYE7BY. You are receiving this because you are subscribed to this thread.Message ID: @.***>

Hello,

Which version of StarPU are you using ? In a previous message, you wrote

$ apt-cache search starpu $ sudo apt-get install libstarpu-1.3 libstarpu-dev $ ./autogen.sh $ ./configure $ mkdir build $ cd build $ ./configure $ make $ make install $ export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:$STARPU_PATH/lib/pkgconfig

The first 2 lines install StarPU as a debian package, and the lines below install StarPU directly from the sources.

Also, as Samuel said, generating a FxT trace should help to understand what is happening.

Cheers,

Nathalie

sthibaul commented 1 year ago

We are sure that no other program is running on our machines

Yes, but are you sure that you are really using the different CPU cores?

You can try to e.g. run /bin/hostname to make sure that the different MPI ranks actually go to different machines, for a start.