Closed pakcheera closed 3 months ago
IIRC two A100s would also work but not one A100. The main training stage uses significantly more memory which would be the bottleneck.
What would be the most effective way to reduce the memory requirement for training? Lower batch size, or perhaps reduce seq_length? Do you think its feasible to train on a single A100 or would the result be significantly worse such that its not worth it?
I cannot attest to which one is the "best". You might also try gradient accumulation.
I want to retraining the model from my datasets. I see you trained with four A100 GPUs. Cloud you tell me what is minimum required for training it (ex. one single A100 GPU).