Multi-node training - Githubissues

mlfoundations / open_lm

A repository for research on medium sized language models.

MIT License

479 stars 69 forks source link

Multi-node training #305

Open LeoXinhaoLee opened 2 months ago

LeoXinhaoLee commented 2 months ago

Hi, thank you so much for releasing this great code base!

I noticed that your Laion blog says that the pre-training of OpenLM 1B/7B took place on 128 or 256 A100s. Therefore, I'm wondering if the current code supports multi-node training? The current training command seems to only use 4 gpus on 1 node.

Thank you very much!

sedrick-keh-tri commented 2 months ago

Yes, OpenLM supports multi-node training. The standard torchrun multi-node setup should work fine. If you are using something like AWS sagemaker, we also have sample codes here