What is necessary in the training script?

hpcaitech / Open-Sora

Open-Sora: Democratizing Efficient Video Production for All

https://hpcaitech.github.io/Open-Sora/

Apache License 2.0

19.88k stars 1.88k forks source link

What is necessary in the training script? #527

Open ArnaudFickinger opened 1 week ago

ArnaudFickinger commented 1 week ago

I noticed that coordinator.block_all(), torch.set_num_threads(1) and dist.barrier() were added to the training script. Were they added for debugging purpose only or are they useful for training?

zhengzangw commented 1 week ago

Actually, they are useful for training when you train the model in a large-scale distributed system. We place them in the appropriate place to make the distributed training more stable.

If you are training on a small scale, or pre-training with a very robust distributed system, you can try removing them. But these sentences will introduce neglectible overhead.