Open SmudgedWings opened 2 weeks ago
Hi there. Total training time can depend a lot on your infra structure. We trained with a batch size of 16 on 4xV100 32GB GPUs, also pulling data directly from a fast access storage system. Many details can impact the training time. I believe it would be more interesting for reproducibility to match the total number of steps, even if it takes more time. All my checkpoints are around 1 million steps (102400 to be more specific). Consider reaching that number of steps.
Following the training tutorial in the README without modifying any code or parameters, I trained the model on an H800x8 machine with 80GB. Each epoch takes 55 minutes, while the total_epochs in the code is set to 1000 by default. This seems inconsistent with the 10-hour training time mentioned in the paper. I’m not sure what went wrong. Could you please provide some guidance?