huggingface / trl

Train transformer language models with reinforcement learning.
http://hf.co/docs/trl
Apache License 2.0
9.99k stars 1.26k forks source link

[DOCS] `PPOTrainer` references are missing in the API docs #623

Closed davidberenstein1957 closed 1 year ago

davidberenstein1957 commented 1 year ago

Hi @younesbelkada,

The API reference does not contain a concise example on using the PPOTrainer.

Perhaps, something like this would already suffice. But perhaps you prefer the usage displayed in this image? image

davidberenstein1957 commented 1 year ago

If you would allow me, I would love to pick this up.

lvwerra commented 1 year ago

Hi @davidberenstein1957, the example should just give a concise taste on what the trainers looks like in TRL, thus each example is not a executable code (e.g. dataset loading is missing in everyone). What do you feel is missing exactly?

davidberenstein1957 commented 1 year ago

@lvwerra thanks for the response and great package.

this but for the PPOTrainer, which I think would mean creating a more complete overview as a docs/source/ppo_trainer.mdx.

lvwerra commented 1 year ago

I see, yes that could nice and a bit more systematic. We already have a lot of info in customization.mdx but we could indeed move it to a ppo_trainer.mdx and add it to the same subsection in the ToC. If you want to take a stab at it I'll be happy to review it.

davidberenstein1957 commented 1 year ago

@lvwerra I have started some efforts and will be able to create a draft PR this coming weekend.

davidberenstein1957 commented 1 year ago

Also, might it more comprehensive to move some of the PPO-logic into the PPOConfig?

Having something like what is shown underneath my unify the API usage a bit more.

from transformers import pipeline
from trl import PPOConfig, PPOTrainer

config = PPOConfig(
    *args,
    **kwargs,
    reward_model: pipeline,
    generation_args: dict
)
trainer = PPOTrainer(
    config=config,
    model,
    tokenizer
)
trainer.train()

Alternatively, we might pass a Callable reward_function to the config?

from transformers import pipeline
from trl import PPOConfig, PPOTrainer

def reward_func(examples):
    return rewards

config = PPOConfig(
    *args,
    **kwargs,
    reward_func: Callable,
)
trainer = PPOTrainer(
    config=config,
    model,
    tokenizer
)
trainer.train()
lvwerra commented 1 year ago

Indeed, we have been contemplating this, but the evolving logic inside the generation/reward/optimization loop was the main reason we haven't settled, yet. Would you mind opening an issue for that and we can see if there's some community traction for it.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.