Closed afernandezody closed 3 years ago
@afernandezody i guess the transports in use are TCP and shared memory? Can you provide one example of a benchmark and number of cores where you see the most difference vs BTL components (command line, number of nodes, and number of cores per node)? thank you!
The standard configuration has tcp, openib, vader and self as transports. There's no significant loss in performance for 72 cores but the same can't be said for 288 cores (this configuration correspond to 6 AWS c5.18xlarge instances - each of this instances is a dual socket Intel Xeon Platinum 8124). I have measured performance with several tools including the NAS benchmark. For class D, some results (time to complete the computations) are:
BT (standard cluster) with 256 MPI ranks (it needs to be N^2) - 98.25 s
BT (ucx cluster) with 256 MPI ranks - 120.51 s
SP (standard cluster) with 256 MPI ranks - 131.51 s
SP (ucx cluster) with 256 MPI ranks - 148.40 s
The jobs can be submitted as mpirun -np 256 -mca pml ucx ./executable
or mpirun -np 256 -mca pml ucx --mca btl ^vader,tcp,openib,uct -x UCX_NET_DEVICES=all ./executable
thanks , it seems mostly related to TCP performance then. adding @dmitrygx
@afernandezody do you have those benchmarks available for free usage? [upd] sorry for inattentive reading, I see that this is NAS benchmark
i guess it's https://www.nas.nasa.gov/publications/npb.html
This is NPB https://www.nas.nasa.gov/publications/npb.html
i guess it's https://www.nas.nasa.gov/publications/npb.html
This is NPB https://www.nas.nasa.gov/publications/npb.html
@yosefe @shamisp thank you!
Sorry, I should have been a bit more specific (my background is CFD and, sometimes, I forget that many people in the Github community have never used this type of software). The previous results were measured with NPBv3.4. NASA has several codes that are restricted to US citizens/institutions but NAS is available to most everyone (not sure if there is any restriction to some very specific countries).
Hello, I'm combining UCX 1.6.0 with OpenMPI (4.0.1) in clusters created with instances from AWS. I have no issue compiling and building everything. The jobs run fine but the performance is just not there. Small clusters (72 cores) exhibit the expected performance but increasing the core count (288) leads to performance deterioration (around 20%) versus standard clusters, which use BTL components. The loss in performance is pretty consistent across different benchmarks and I have tried the usual flags (e.g. --mca btl ^vader,tcp,openib,uct), which make no difference. I was wondering if anyone has noticed similar performance in a cloud environment or what could be the root of the behavior. Thanks.