pangeo-data / benchmarking

Benchmarking & Scaling Studies of the Pangeo Platform
https://binder.pangeo.io/v2/gh/pangeo-data/benchmarking/master
Apache License 2.0
12 stars 6 forks source link

Would pangeo application using Infiniband based cluster speed up using RDMA optimised communication lib? #43

Open tinaok opened 3 years ago

tinaok commented 3 years ago

Basic installation of pangeo on infiniband cluster, use Tcp ip communication. Thus not benefitting from it's 'real' high speed /band width communication. Using RDMA connection between dask clients , running on an infiniband based cluster, should speed up it's communication.. There are benchmarks on infiniband cluster with GPU's using UCXPY or MPI4Dask. (https://blog.dask.org/2019/06/09/ucx-dgx, https://www.hpcadvisorycouncil.com/events/2020/australia-conference/pdf/HighPerfDeepMachineLearnonHPCSyst_010920_DKPanda.pdf, slide 46-47, http://hibd.cse.ohio-state.edu/features/#mpi4dask) Our pangeo bench is based on CPU, and results we have in our repo uses infiniband based HPC clusters. Benchmarking of pangeo, for communication-bound (like rechunking, ..) may get speed up.

kmpaul commented 3 years ago

This is great, @tinaok! Thanks for the ping.

By the way, my colleague (@halehawk) is working on some stuff in a fork of this repository and is planning on doing a merge at some point in the future. One of the things @halehawk is working on is a platform service to hold all benchmarking results/plots submitted from other people using the same benchmarking utility. She's done some thing to address other issues in this repo, too.

Anyway, I am hopeful that after the merge, we can collaborate on this and maybe get some benchmarking measurements with Dask+Infiniband!

halehawk commented 3 years ago

@tinaok @kmpaul, it is a good idea. Just I am wondering if Dask works with RDMA optimised communication lib or not, if not, how many efforts need to make it available?

On Tue, Jan 26, 2021 at 10:07 AM Kevin Paul notifications@github.com wrote:

This is great, @tinaok https://github.com/tinaok! Thanks for the ping.

By the way, my colleague (@halehawk https://github.com/halehawk) is working on some stuff in a fork of this repository https://github.com/NCAR/benchmarking and is planning on doing a merge at some point in the future. One of the things @halehawk https://github.com/halehawk is working on is a platform service to hold all benchmarking results/plots submitted from other people using the same benchmarking utility. She's done some thing to address other issues in this repo, too.

Anyway, I am hopeful that after the merge, we can collaborate on this and maybe get some benchmarking measurements with Dask+Infiniband!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/benchmarking/issues/43#issuecomment-767688061, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACAPEFHP6JFJ43MEMPS6JELS33ZGDANCNFSM4WTEZBLQ .

kmpaul commented 3 years ago

@halehawk: Yes. It sounds like (@tinaok, correct me if I'm wrong) the new Dask+Infiniband work will use RDMA optimization. Which could be a huge benefit!