Loss in performance as core count increases

openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)

http://www.openucx.org

Other

1.14k stars 424 forks source link

Loss in performance as core count increases #4102

Closed afernandezody closed 3 years ago

afernandezody commented 5 years ago

Hello, I'm combining UCX 1.6.0 with OpenMPI (4.0.1) in clusters created with instances from AWS. I have no issue compiling and building everything. The jobs run fine but the performance is just not there. Small clusters (72 cores) exhibit the expected performance but increasing the core count (288) leads to performance deterioration (around 20%) versus standard clusters, which use BTL components. The loss in performance is pretty consistent across different benchmarks and I have tried the usual flags (e.g. --mca btl ^vader,tcp,openib,uct), which make no difference. I was wondering if anyone has noticed similar performance in a cloud environment or what could be the root of the behavior. Thanks.

yosefe commented 5 years ago

@afernandezody i guess the transports in use are TCP and shared memory? Can you provide one example of a benchmark and number of cores where you see the most difference vs BTL components (command line, number of nodes, and number of cores per node)? thank you!

afernandezody commented 5 years ago

The standard configuration has tcp, openib, vader and self as transports. There's no significant loss in performance for 72 cores but the same can't be said for 288 cores (this configuration correspond to 6 AWS c5.18xlarge instances - each of this instances is a dual socket Intel Xeon Platinum 8124). I have measured performance with several tools including the NAS benchmark. For class D, some results (time to complete the computations) are: BT (standard cluster) with 256 MPI ranks (it needs to be N^2) - 98.25 s BT (ucx cluster) with 256 MPI ranks - 120.51 s SP (standard cluster) with 256 MPI ranks - 131.51 s SP (ucx cluster) with 256 MPI ranks - 148.40 s The jobs can be submitted as mpirun -np 256 -mca pml ucx ./executable or mpirun -np 256 -mca pml ucx --mca btl ^vader,tcp,openib,uct -x UCX_NET_DEVICES=all ./executable

yosefe commented 5 years ago

thanks , it seems mostly related to TCP performance then. adding @dmitrygx

dmitrygx commented 5 years ago

@afernandezody do you have those benchmarks available for free usage? [upd] sorry for inattentive reading, I see that this is NAS benchmark

yosefe commented 5 years ago

i guess it's https://www.nas.nasa.gov/publications/npb.html

shamisp commented 5 years ago

This is NPB https://www.nas.nasa.gov/publications/npb.html

dmitrygx commented 5 years ago

i guess it's https://www.nas.nasa.gov/publications/npb.html

This is NPB https://www.nas.nasa.gov/publications/npb.html

@yosefe @shamisp thank you!

afernandezody commented 5 years ago

Sorry, I should have been a bit more specific (my background is CFD and, sometimes, I forget that many people in the Github community have never used this type of software). The previous results were measured with NPBv3.4. NASA has several codes that are restricted to US citizens/institutions but NAS is available to most everyone (not sure if there is any restriction to some very specific countries).