Global batch size question

huggingface / alignment-handbook

Robust recipes to align language models with human and AI preferences

https://huggingface.co/HuggingFaceH4

Apache License 2.0

4.28k stars 367 forks source link

Global batch size question #44

Open liutianlin0121 opened 8 months ago

liutianlin0121 commented 8 months ago

Hi!

Thanks again for the awesome repo. I have a small question regarding the global batch size of DPO training reported in the paper vs used in the code base.

In the paper, it mentions that, for DPO, "We train all models with a global batch size of 32". This is consistent to the the hyperparam of HuggingFaceH4/zephyr-7b-beta.

In the codebase, we are suggested to use 8 GPUs to reproduce zephyr-7b-beta here.

You will require 8 GPUs (80GB of VRAM) to train the full model.

Since per_device_train_batch_size=8 in the recipes/zephyr-7b-beta/dpo/config_full.yaml, this means that the global batch size is 64, and not 32, when using 8 GPUs. While this is different from the paper, the global batchsize = 64 setting is consistent with the hyperparam in alignment-handbook/zephyr-7b-dpo-full.

My guess is that global batchsize = 32 or 64 would give similar performance, say, on MT-bench? Could you confirm it? Many thanks! I am about to launch some experiments, and I wish to get the details correct so as to reproduce the results from the paper as closely as possible 🙏.

timothylimyl commented 8 months ago

Yeah, I was thinking the same, should be per_device_train_batch_size: 4 instead of 8 since the assumption is 8 GPUs here.

However, I think the mistake kind of propagated in their own replication: https://huggingface.co/alignment-handbook/zephyr-7b-dpo-full

For the official model released, as you mentioned , the DPO batch params seem to contradict their replication based on this repo.

Also the official model link does not have any details on the SFT part, so I have no idea yet on whether is it LoRA or Full finetuning that the HF team decided to release as the official model.

timothylimyl commented 8 months ago

anyways, I am running an experiment for the DPO based on 4 GPUs, leaving the batch to be 8. If the loss is the same then I can confirm it official release is using the ....-sft-full model and batch size is correct.

liutianlin0121 commented 8 months ago

Hi!

I re-ran MT-bench to compare the two public DPO-trained zephyr-7b checkpoints:

mt-bench-result-radar (1)

The MT-bench score of HuggingFaceH4/zephyr-7b-beta (blue curves above) closely reproduces the number reported in the paper. The number is 7.34 in the paper (Table 1), and the score from my re-run is 7.37.

But the MT-bench score of alignment-handbook/zephyr-7b-dpo-full (yellow curves above) was worse overall. The score is 7.09.

There could be multiple reasons, such as:

the randomness of GPT4 evaluation used in MT-bench (if anyone has the resources to rerun MT-bench multiple times, that'll be great)
the difference in the SFT step (the two models used different SFT checkpoints)
the difference in the DPO step (e.g., the global batch size difference that I mentioned; I am not sure if this is the only difference).

I am wondering if you have any insights @lewtun. It would be great if we can use the recipe to re-train the stronger HuggingFaceH4/zephyr-7b-beta with a MT-bench score of 7.37. 🙏

timothylimyl commented 8 months ago

@liutianlin0121 I do not think it's the issue with the replication of the model (as we go about re-training along the recipes provided). It seems that even the officially released hugging face model score has degraded.

liutianlin0121 commented 8 months ago

@liutianlin0121 I do not think it's the issue with the replication of the model (as we go about re-training along the recipes provided). It seems that even the officially released hugging face model score has degraded.

Yeah my objective is to reproduce the original model HuggingFaceH4/zephyr-7b-beta. Using the existing code base, I suppose I can reproduce the handbook model alignment-handbook/zephyr-7b-dpo-full, but the latter is somehow weaker in MT-bench compared to the former.

timothylimyl commented 8 months ago

nt-handbook/zephyr-7b-dpo-full (yellow curves above) was worse overall. The score is 7.09.

@liutianlin0121,

It seems that I misunderstand your post.

Just to confirm, you are able to replicate (close enough) the MT-Bench score for the official HF model of 7.37?

liutianlin0121 commented 8 months ago

@timothylimyl Yes. I was able to reproduce the MT-Bench score for the official model. But I ran the MT-bench evaluation a few weeks ago. To debug, perhaps it would be useful to take a look at the GPT4-generated judgement at data/mt_bench/model_judgment/gpt-4_single.jsonl. Do they appear reasonable?

In one of my early MT-Bench runs, I used too many concurrent-api-call with python gen_judgment.py --model-list [LIST-OF-MODEL-ID] --parallel A_LARGE_NUMBER_LIKE_8_or_16 This caused some errors in the GPT4 model judgements at data/mt_bench/model_judgment/gpt-4_single.jsonl. Specifically, some score fields were populated with $error, and these $error were automatically omitted when computing the mean scores. After that, I only use a single concurrent api call, and the evaluation speed is not much slower. Not sure if this is the case for your evaluation, but perhaps it would be helpful to manually look at several model judgement.