tensorflow / mesh

Mesh TensorFlow: Model Parallelism Made Easier
Apache License 2.0
1.58k stars 254 forks source link

Mesh tensorflow support for multi-node #201

Open assij opened 3 years ago

assij commented 3 years ago

Hi, Does mesh tensorflow supports multi node training ( i.e. each node has #x GPUs attached to it)? I'm using 2 nodes each with 8 GPUs and would like to train on the entire (2 nodes *8 gpus )=16 GPUs. How do I configure mesh tensorflow to train in a multi node setup?

Thanks

1106944911 commented 3 years ago

have the same question

dkajtoch commented 3 years ago

The same question! What is the status of that feature? In the paper https://arxiv.org/pdf/1811.02084.pdf you mention "Implementation of SPMD programming on CPU/GPU clusters" (Future Work). Is the project dead? @adarob @dustinvtran ?

adarob commented 3 years ago

@nshazeer to comment.

While this project is not dead, I would not expect to see significant new features added. There are some TF2- and JAX-based libraries inspired by Mesh TensorFlow under development that will have this functionality. They will also be more "production-ready", i.e., better supported and documented :)

https://github.com/tensorflow/lingvo may also support this now.

zaccharieramzi commented 3 years ago

@adarob do you have a ballpark release date for these libraries?

adarob commented 3 years ago

Early 2021 for jax