huggingface / trl

Train transformer language models with reinforcement learning.
http://hf.co/docs/trl
Apache License 2.0
9.83k stars 1.24k forks source link

Meaning of `steps` parameter in `trl.PPOConfig` class #1322

Closed medric49 closed 7 months ago

medric49 commented 8 months ago

My concern is about this example trl/examples/scripts/ppo.py.

In the documentation, it is mentioned that steps parameter in trl.PPOConfig represents the number of training steps, but when I change it, the number of training batches produced by trainer.dataloader here does not change (apparently it only depends on the dataset size). What does this parameter really represent, and in my case, how can I increase the number of training batches?

medric49 commented 8 months ago

According to this it seems trl.PPOConfig.steps attribute is never used to generate the dataloader.

A solution that might help to set the actual number of training steps using steps:

    def dataloader():
        current_step = 0
        dataloader_iter = iter(ppo_trainer.dataloader)
        while current_step < learning_config.steps:
            try:
                yield next(dataloader_iter)
            except StopIteration:
                dataloader_iter = iter(ppo_trainer.dataloader)
                yield next(dataloader_iter)
            current_step += 1

    for step, batch in tqdm(enumerate(dataloader()), total=ppo_config.steps):
        queries = batch["input_ids"]
karl-hajjar commented 8 months ago

Hi @medric49, I think the number of steps is not related to the number of batches in the dataloader, but to how many optimization (gradient descent) steps you take. Imagine you have 10 batches in your training dataloader, if learning_config.stepsis 20, you will go over the whole training dataset twice (each batch will be used twice to do a gradient descent step) !

medric49 commented 8 months ago

Hi @karl-hajjar I think what you have just described is the role of the ppo_epochs parameter. If ppo_epochs=4, a gradient descent will be applied 4 times on each batch. Also, I looked in the source code, steps attribute from PPOConfig never seems to be used in PPOTrainer.

karl-hajjar commented 8 months ago

@medric49 yes I'm not sure what the steps attribute does but in any case the number of batches is always fixed and equal (roughly) to len(dataset) // batch_size. However, as you noted, by increasing the epochs parameter (or maybe the steps parameter) you increase the number of times each batch is used. This does not mean the dataloader will have more batches but each batch is used more times, which is what you want in the end: more gradient step iterations. So you cannot really increase the number of batches but it does not mean that you cannot do more training iterations !

medric49 commented 8 months ago

Yes, with the current version, we have a precise number of invariant batches on which we can perform optimizations.

However, it seemed to me that the most popular method consists in generating random batches steps times from a replay buffer (our dataset here) and that the ppo_epochs parameter allows to control the number of optimizations done per batch during each training step.

So, in the current implementation, the behavior of the dataloader is different from what I was expecting, but works also very well. Just that the steps parameter seems to be useless and the concept of "number of training steps" doesn't not exist anymore.

github-actions[bot] commented 7 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.