horovod / horovod

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
http://horovod.ai
Other
14.18k stars 2.23k forks source link

Can I call horovod training process in proc = subprocess.Popen(command, shell=True, cwd=cwd) using command #4017

Open bit-pku-zdf opened 8 months ago

bit-pku-zdf commented 8 months ago

Environment:

  1. Framework: tensorflow
  2. Framework version:1.15
  3. Horovod version:0.23.0
  4. MPI version:
  5. CUDA version:
  6. NCCL version:
  7. Python version:3.8.10
  8. Spark / PySpark version:NO
  9. Ray version:NO
  10. OS and version:NO
  11. GCC version:NO
  12. CMake version:NO

Due to historical reasons, we need to use subprocess.Popen to call horovod + tf training process like, subprocess.Popen("horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train.py", shell=True, cwd=cwd) Can I do this? Is there any problem with this? Thank you