Strange parallel performance of scalapack routines

xhzheng commented 8 years ago

I am testing some scalapack routines with scalapy. I find that the time of parallel with scalapy is much much longer than just by using scipy. We can take pdsyev as an example. I have done a scaling test with different number of processes. The following are part of the results. The first colomn is the number of processes, the second is the matrix size, the third is the time by scalapy.lowlevel.pdsyev subroutine and the last one is the time by scipy.linalg.eigh with only one process:

4 3000 415.0334515571594 72.1984703540802

8 3000 774.7427942752838 91.13431119918823

16 3000 2001.8645451068878 131.86768698692322

32 3000 9216.71579861641 173.42458987236023

64 3000 9288.961811304092 173.8198528289795

There are three strange points:

the time of parallel of scalapy is much longer than scipy with a single process.
the time increases greatly with the number of process.
even for scipy.linalg.eigh with a single process, the time also increases greatly with the number of processes. By the way, the single process calculations are performed in the same parallel routine. We only let rank=0 to do this calculation. It seems that the calculation time is greatly affected by the number of processes used in this job. However, if I call the scipy.linalg.eigh subroutine separately with no mpi4py and scalapy, it takes only 11.6355 seconds, much smaller than obtained in the last colomn shown above.

This test has been performed in two different supercomputers and the same conclusion is obtained. Does somebody also see this and know why? Please find the code of my test in the attachment.

pdsyev.txt

jrs65 commented 8 years ago

Thanks for reporting this. I've done scaling tests before, and they've always seemed fine (scaled up to > 300 nodes), so I guess we need to figure out where the difference is coming from.

How exactly are you running the test? i.e. how many processes? how many nodes? how many threads?

This kind of behaviour is often exhibited if MPI was placing the processes incorrectly. For example if it was placing them all on one node rather than spreading them out. It would also be helpful if you could send the MPI command line you used to run the test, then I can see if I can reproduce it.

Another point which I noticed in your test code, you've set your process grid to be [2, nproc/2]. ScaLAPACK is more efficient with nearly square distributions so you could try changing to that.

xhzheng commented 8 years ago

Thank you very much for your reply.

Since I only tested with 1-64 processes, the nodes are from only 1-6. The environment for doing the above tests is as follows:

python: anaconda3 + mpi4py + scalapy MPI: MPI/Intel/IMPI/5.0.2.044 scalapack: compiled by myself based on openblas from anaconda3.

It was tested on a supercomputer. Since at the beginning, I always get the following error messages:

cn9394:UCM:7376:4f820460: 1870 us(17 us): open_hca: ibv_get_device_list() failed DAT Registry: sysconfdir, bad filename - /etc/rdma/compat-dapl/dat.conf, retry default at /etc/dat.conf DAT Registry: sysconfdir, bad filename - /etc/rdma/compat-dapl/dat.conf, retry default at /etc/dat.conf ... librdmacm: Fatal: unable to get RDMA device list librdmacm: Warning: couldn't read ABI version. ... cn10374:CMA:516a:7efc2aa0: 337 us(337 us): open_hca: getaddr_netdev ERROR:No such device. Is eth2 configured? ...

I believe that it is related to RDMA. Thus, in the mpi command, I added the following paramter to avoid using RDMA: -env I_MPI_DEVICE sock The command is: mpirun -n 16 -env I_MPI_DEVICE sock python p.py >log.txt This successfully supresses the above error messages. But the results are too exaggerative as reported.

By the way, for different number of processors, I used different input file so that the processor grids are as follows: 4,8: 2 x size/2 16: 4 x 4 32 and 64: 8 x size/8

xhzheng commented 8 years ago

There may be problems by adding -env I_MPI_DEVICE sock to avoid the error messages. However, I have also done tests on my own cluster with: Anaconda3-2.4.1 + openmpi-1.10.1 + mpi4py-2.0.0 + scalapack-2.0.2 + scalapy libscalapack.so is compiled based on the libopenblas.so in anaconda. I have used 1-4 nodes, each of which has 16 cpu cores to do the test. The following are my mpi commands:

mpirun --map-by node --mca oob_tcp_if_include "192.168.1.11/24" --mca btl_tcp_if_include "192.168.1.11/24" -hostfile hostfile -n 1 python p0.py >log.txt mpirun --map-by node --mca oob_tcp_if_include "192.168.1.11/24" --mca btl_tcp_if_include "192.168.1.11/24" -hostfile hostfile -n 2 python p.py >>log.txt mpirun --map-by node --mca oob_tcp_if_include "192.168.1.11/24" --mca btl_tcp_if_include "192.168.1.11/24" -hostfile hostfile -n 4 python p.py >>log.txt mpirun --map-by node --mca oob_tcp_if_include "192.168.1.11/24" --mca btl_tcp_if_include "192.168.1.11/24" -hostfile hostfile -n 8 python p.py >>log.txt mpirun --map-by node --mca oob_tcp_if_include "192.168.1.11/24" --mca btl_tcp_if_include "192.168.1.11/24" -hostfile hostfile -n 16 python p2.py >>log.txt mpirun --map-by node --mca oob_tcp_if_include "192.168.1.11/24" --mca btl_tcp_if_include "192.168.1.11/24" -hostfile hostfile -n 32 python p3.py >> log.txt mpirun --map-by node --mca oob_tcp_if_include "192.168.1.11/24" --mca btl_tcp_if_include "192.168.1.11/24" -hostfile hostfile -n 64 python p3.py >>log.txt

The host file contains 4 nodes: gu01 gu02 gu03 gu04

The results are as follows:

1 3000 83.80727100372314 6.165726900100708 2 3000 49.89164924621582 6.072669506072998 4 3000 18.85034155845642 6.205003976821899 8 3000 125.45469236373901 5.966212272644043 16 3000 443.9763329029083 6.1583099365234375 32 3000 ......

From this test, we find that for the single process calculations with scipy, the time is almost constant with any number of processes. This is OK. However, for the parallel calculations, the time is still always much much longer than that by scipy. Even for the 32 processes case, I have waited for 2 hour and a half, namely 9000 seconds, the calculation is still ended. When I use "top" to see the status on one of the nodes, I get:

KiB Mem: 65747808 total, 10933660 used, 54814148 free, 279452 buffers KiB Swap: 65534972 total, 0 used, 65534972 free. 8637384 cached Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18599 xhzheng 20 0 1347228 172160 14116 R 100.4 0.3 156:36.62 python
18600 xhzheng 20 0 1319892 155928 14032 R 100.1 0.2 155:50.15 python
18602 xhzheng 20 0 1319764 146100 14376 R 100.1 0.2 155:54.08 python
18603 xhzheng 20 0 1319756 141784 14136 R 100.1 0.2 156:41.10 python
18605 xhzheng 20 0 1319428 142196 14392 R 100.1 0.2 155:47.19 python
18606 xhzheng 20 0 1319420 148404 14328 R 100.1 0.2 156:01.17 python
18601 xhzheng 20 0 1319836 140328 14068 R 99.8 0.2 156:12.17 python
18604 xhzheng 20 0 1319784 152180 14384 R 99.8 0.2 156:14.56 python

It means that it is always running, but it consumes very little memory. It is a little abnormal. It looks like it enters deadlock.

So, do you have any idea about this, please? Please just focus on this test first.

xychen-ION commented 6 years ago

Hi!! I am also using this package for parallel computing. I got the same situation as yours. The parallel computed eignh is much more time consuming than just using a process with numpy. How is your solution and final answer?

jrs65 / scalapy

Strange parallel performance of scalapack routines #25