tlc4418 / llm_optimization

A repo for RLHF training and BoN over LLMs, with support for reward model ensembles.
https://arxiv.org/abs/2310.02743
MIT License
26 stars 1 forks source link

Best-of-n Pipeline takes ages - how to accelerate? #9

Closed RylanSchaeffer closed 1 month ago

RylanSchaeffer commented 1 month ago

I've been running the Best-of-n pipeline with 5 seeds, and after 18 hours, I'm only on seed 4; the reward model ensembling evaluation has not yet begun yet. Is there some way to accelerate this? Could the best-of-n evaluations be parallelized, perhaps?

I'm using the default command:

python -u src/bon/run_bon_pipeline.py <path to reward models with 5 seeds> --ensembles
tlc4418 commented 1 month ago

Hmm it shouldn't be this slow tbh. It should take maybe ~20min-1hr per seed (don't remember the exact speed). Do you know which part of the pipeline is slowing you down? Also what model size are you using? But yeah it's not the fastest because you need to do inference over 12600 or so prompts. And then the unbiased sampler needs to compute the unbiased estimates several times for each n you choose to evaluate on (though I've already parallelized this through python multiprocessing and it's now decently quick). If it's any reassurance also, the ensemble evaluation itself after computing the individual seeds is rather quick.

To parallelize I was running individual seeds as separate jobs concurrently, and then calling the ensemble scoring (https://github.com/tlc4418/llm_optimization/blob/main/src/bon/run_bon_ensembles.py#L14) using the stored individual seed results. If you have the compute, you can do this by calling the single-RM pipeline several times as separate jobs with only a single seed, and then simply call this ensemble scoring function I just linked above, pointing to the directory where you stored the individual seed outputs.