When training, it is based on tramsformers' training course.
It is started on the A100-80g machine, but the per gpu batch-size can be set to 2 at most, and there is extremely unbalanced memory occupation on multiple cards, such as 60G+ for the 0th card and 30G+ for other cards.
In addition, is there a training parameter? Because of the current training strategy, the loss value is very large, and it almost drops slowly.
When training, it is based on tramsformers' training course. It is started on the A100-80g machine, but the per gpu batch-size can be set to 2 at most, and there is extremely unbalanced memory occupation on multiple cards, such as 60G+ for the 0th card and 30G+ for other cards. In addition, is there a training parameter? Because of the current training strategy, the loss value is very large, and it almost drops slowly.