Closed nisshimura closed 1 year ago
Hi there, thank you for your interest in our project. Regarding training, this code snippet illustrates the logic of one training iteration. Regarding parallelization and batch size, since gradients computed on every GPU are then synchronized, the effective batch size would be the batch size on a singe GPU multiplied by the number of parallel GPUs. To be specific, to train the largest model, we set an effective batch size of 128, which amounts to a local batch size of 16 across 8 GPUs.
Hello, I have a couple of questions about the project:
Training Code: Is it possible for the training code to be released? It would greatly help in understanding the implementation details and for reproducing the results.
Parallelization & Batch Size: When training, does parallelizing episodes across multiple GPUs equate to setting the batch size? I would appreciate some clarification on how parallelization and batch size are related in the context of this project.
Thank you for your time and consideration. Looking forward to your response.