Closed davidberenstein1957 closed 1 year ago
If you would allow me, I would love to pick this up.
Hi @davidberenstein1957, the example should just give a concise taste on what the trainers looks like in TRL, thus each example is not a executable code (e.g. dataset loading is missing in everyone). What do you feel is missing exactly?
@lvwerra thanks for the response and great package.
this but for the PPOTrainer, which I think would mean creating a more complete overview as a docs/source/ppo_trainer.mdx
.
I see, yes that could nice and a bit more systematic. We already have a lot of info in customization.mdx
but we could indeed move it to a ppo_trainer.mdx
and add it to the same subsection in the ToC. If you want to take a stab at it I'll be happy to review it.
@lvwerra I have started some efforts and will be able to create a draft PR this coming weekend.
Also, might it more comprehensive to move some of the PPO
-logic into the PPOConfig
?
Having something like what is shown underneath my unify the API usage a bit more.
from transformers import pipeline
from trl import PPOConfig, PPOTrainer
config = PPOConfig(
*args,
**kwargs,
reward_model: pipeline,
generation_args: dict
)
trainer = PPOTrainer(
config=config,
model,
tokenizer
)
trainer.train()
Alternatively, we might pass a Callable
reward_function
to the config?
from transformers import pipeline
from trl import PPOConfig, PPOTrainer
def reward_func(examples):
return rewards
config = PPOConfig(
*args,
**kwargs,
reward_func: Callable,
)
trainer = PPOTrainer(
config=config,
model,
tokenizer
)
trainer.train()
Indeed, we have been contemplating this, but the evolving logic inside the generation/reward/optimization loop was the main reason we haven't settled, yet. Would you mind opening an issue for that and we can see if there's some community traction for it.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Hi @younesbelkada,
The API reference does not contain a concise example on using the
PPOTrainer
.Perhaps, something like this would already suffice. But perhaps you prefer the usage displayed in this image?