Closed huiwy closed 1 year ago
Can you estimate the training time on your device?
By the way, if you have A100 GPUs with 80GB memory, training will be a little faster to set max_tokens=8192 and update_freq=1.
It takes around 1s to finish 1 update, so the total training time is 83 hours.
According to our paper, the training process will take around 32 hours using 16xV100 GPUs. I have tried running the codes on my 8xA100 server, and it typically completes 2 to 3 updates in 1 second. This means that for 300k updates, it would take approximately 30 to 40 hours.
I'm not sure why your server is running slow. To investigate the issue, you can check the GPU-Utils to determine if the slow performance is due to GPU computation or other factors like CPU or I/O.
Please ensure the following:
I didn't use lightseq. It is much faster after I switch to lightseq.
To replicate and build upon your results, it is crucial for me to have a comprehensive understanding of the training configuration employed during the experiments. Is the
examples/DA-Transformer/wmt14_ende.sh
the config used to get the results in your paper. I found it impossible to finished 300,000 updates within 16 hours using 8*A100 using that config.