Hi, thank you so much for releasing this great code base!
I noticed that your Laion blog says that the pre-training of OpenLM 1B/7B took place on 128 or 256 A100s. Therefore, I'm wondering if the current code supports multi-node training? The current training command seems to only use 4 gpus on 1 node.
Yes, OpenLM supports multi-node training. The standard torchrun multi-node setup should work fine. If you are using something like AWS sagemaker, we also have sample codes here
Hi, thank you so much for releasing this great code base!
I noticed that your Laion blog says that the pre-training of OpenLM 1B/7B took place on 128 or 256 A100s. Therefore, I'm wondering if the current code supports multi-node training? The current training command seems to only use 4 gpus on 1 node.
Thank you very much!