Thanks for the great work. I read in your paper that the second pre-training stage costs around 12 days using 8 GPUs. I am wondering if you tried to use multi-machine distributed training to accelerate it. Is your code base compatible with that? Thanks in advance.
Hi @ArrowLuo ,
Thanks for the great work. I read in your paper that the second pre-training stage costs around 12 days using 8 GPUs. I am wondering if you tried to use multi-machine distributed training to accelerate it. Is your code base compatible with that? Thanks in advance.