Closed XYaoooo closed 1 month ago
Hi you should use gradient accumulation to reduce the effective training batch size. Please refer to this script for more details: https://github.com/princeton-nlp/SimPO/blob/main/training_configs/gemma-2-9b-it-simpo.yaml
Got it. Thanks for your advice.
Hi, could i know your hyper parameters when training with DPO: batch size, beta, learning rate I train on 8 A100 with batch size per device = 16 (as you said in the paper, bz=128), it is out of memory. Best