Blocked LU starpu-mpi performance analyzation

TommyUW commented 1 year ago

Hello, I have run the example code of LU decomposition with StarPU and MPI. I have installed the latest version of StarPU, which is 1.4.0. I set the size of matrix as 4096x4096, the number of blocks is 16. I have tested the program with processes 1, 4, 8, and 16. As shown in the attached image, the total computational time of the program increases as the number of processes increases. I don't know whether the performance of the program is increased or not. Can someone explain it to me? Also, how do I analyze the performance of the program? Thank you very much. Regards, Tianyu

Blocked LU

sthibaul commented 1 year ago

What is your actual platform?

Getting 8GFlop/s on a single node is already very concerning (on my laptop I get 113GFlop/s), it seems your BLAS etc. are behaving wrongly.

Also, to scale over e.g. 16 nodes you really want a large matrix and a larger number of blocks.

how do I analyze the performance of the program

See https://files.inria.fr/starpu/doc/html/OfflinePerformanceTools.html#Off-linePerformanceFeedback

TommyUW commented 1 year ago

What is your actual platform?

Getting 8GFlop/s on a single node is already very concerning (on my laptop I get 113GFlop/s), it seems your BLAS etc. are behaving wrongly.

Also, to scale over e.g. 16 nodes you really want a large matrix and a larger number of blocks.

how do I analyze the performance of the program

See https://files.inria.fr/starpu/doc/html/OfflinePerformanceTools.html#Off-linePerformanceFeedback

Thank you so much for you reply! My platform is just my ThinkPad laptop. It has four CPUs and each has two cores. Also cuda is not enabled, just starpu and mpi, so the GFlops is relatively low. But what I want is scalability. I want my program with 2 or 4 processes can run faster than single process. I have installed OpenBLAS. Is it possible that something can go wrong with it? It works fine for my other programs. The maximum matrix that I can run is 8192x8192. If it goes larger, the running time will be too long. What I found through the top command is that when the program is running with 1 MPI process, the CPU usage is 300%. When running on 2 processes, the usage is 200%. 4 processes, 100%. 8 processes 50%. The usage seems very wired. I used export OMP_NUM_THREADS=1 before running my program every time, but the running time didn't change and neither CPU usage. I am working on the FXT file now. The command that I input is mpirun -n 4 ./plu_example_double 8 -size 4096 -nblocks 16 -p 2 -q 2 Are there any extra commands that I am missing?
By the way, is it possible that we can set up a ZOOM meeting so that we can communicate more conveniently on this problem? Thank you very much.

TommyUW commented 1 year ago

In addition, the version of StarPU is 1.4.0 and the version of MPI is 3.3.2

nfurmento commented 1 year ago

It is not clear from your messages, but are you running your MPI application on your laptop ? There is no chance to get good performances then.

sthibaul commented 1 year ago

My platform is just my ThinkPad laptop

If you running starpu only on your laptop, then using several mpi nodes won't improve performance, on the contrary. StarPU already uses all cores of a single machine. Additionally running mpi will only introduce communication overhead, and core oversubscription (you see it with lower and lower cpu % usage: of course they have to share), which will make the whole very slow due to the spin locks.

It has four CPUs and each has two cores

I guess it's rather 4 cores and each has two threads. Check out what hwloc thinks of it.

Also cuda is not enabled, just starpu and mpi, so the GFlops is relatively low.

Yes, but on my laptop alone (4 cores), I do get 100GFlop/s.

I have installed OpenBLAS. Is it possible that something can go wrong with it?

You can try to run with one mpi node and -nblocks 1, in that case starpu will not perform any parallelism and just let openblas try to achieve whatever it can. If that's still 8GFlop/s then yes there is a problem with it: on a not-too-old core I'd expect at least something like 20-30 GFlop/s.

mpirun -n 4 ./plu_example_double 8 -size 4096 -nblocks 16 -p 2 -q 2 Are there any extra commands that I am missing?

No, that would be it if you really had 4 different machines

By the way, is it possible that we can set up a ZOOM meeting so that we can communicate more conveniently on this problem?

I'm afraid I cannot spend time on this.

TommyUW commented 1 year ago

Hello, thank you very much for your reply. I used the machines at the lab today. There are two same AMD machines. Each socket has 64 cores and each core has 2 threads. I run a matrix of 40960x40960 with 200 blocks. However, the scalability was terrible. With one process, the running time of the program was 20230ms, but with two processes, the running time was 837848ms, which became 40 times slower. The command that I used was mpiexec -n 2 -f /home/wtc/wy/mpi_config_file ./plu_example_double 8 -size 40960 -nblocks 200 -p 2 -q 1 Is there any problem? On each machine, the CPU usage was thousands of percent. If the scalability is good, what number should the usage be? I remember you told me to use bigger matrices since I used 4096 before. Is this one enough? How big the matrix and how big the number of blocks would you recommend?

nfurmento commented 1 year ago

Please make sure your MPI application is having 1 process per node by running

mpiexec -n 2 -f /home/wtc/wy/mpi_config_file hostname

and also as advised before please read https://files.inria.fr/starpu/doc/html/OfflinePerformanceTools.html#Off-linePerformanceFeedback to find out how to analyze the performance of a program

TommyUW commented 1 year ago

Really appreciate your reply. currently I am working on running the program on the two machines now. Also I used the link and analyzed the performance of my program. Here is what I found: On only one machine, the running time of the program is divided into four parts: executing, call back, waiting, sleeping and scheduling. As I increased the number of processes, the executing didn't shorten. It remains the same basically. However, the sleeping time kept increasing. With four processes, the sleeping time takes 60% of my total running time. So this means that it is impossible to get scalability from the program by running on single machine as the other professor mentioned before, right? Also, besides, mpich 3.3.2, OpenBLAS 0.3.23 are there any additional packages required to install in order to get scalability from the program?

TommyUW commented 1 year ago

Dear Professors: Sorry to interrupt you guys again. We have run the mpi_lu program on two AMD machines, each has 128 cores. As shown in the picture, the execution time is shorter and shorter as the number of processes increased. However, the sleeping time kept increasing crazily. We also tested matrix with size 40960 and the result was still similar. What can we do to reduce the sleeping time? Thank you very much.

nfurmento commented 1 year ago

I will rephrase what was said before. If you run an MPI application with 64 nodes in 2 machines, performance will not improve. For each MPI process, StarPU uses all cores of a single machine. Additionally, running MPI will only introduce communication overhead, and core oversubscription (you see it with lower and lower CPU % usage: of course they have to share), which will make the whole very slow due to the spin locks. You could try to lower the number of CPU cores used by each MPI process, and make sure the binding is done properly. Look at https://files.inria.fr/starpu/doc/html/MPISupport.html and https://files.inria.fr/starpu/doc/html/ExecutionConfigurationThroughEnvironmentVariables.html#Basic

starpu-runtime / starpu

Blocked LU starpu-mpi performance analyzation #14