Raise an error when using parameter reallocaiton with the DS backend.
Set use_cuda_graph=True in all example scripts and remove the error message in PPO/GRPO/Reinforce experiments when using CUDAGraph.
Backend API change:
Add a post_hook argument in the forward (aka inference) API, which is a function that does post-processing on the output logits, e.g., collecting log probabilities from the logits. This is useful with mini-batched inference since it doesn't need to save all model outputs and can save a large amount of GPU memory.
Unify the forward and eval_batch APIs. eval_batch is now a special case of forward with a post_hook collecting losses and statistics.
For inference and generate calls, the mini-batches are splitted in the outer loop. We call engine.generate or engine.inference multiple times, which will have similar end-to-end latency but save GPU memory. For example, KV-cache and intermediate activations are not kept for all mini-batches at a time.
In the load_hf_tokenizer function, set trust_remote_code=True be default.
Automatically amend IDs for datasets.
Bug Fixes:
The overlap_param_gather option in the Megatron backend now defaults to False rather than True. PPO w. parameter reallocation can be algorithmically incorrect when enabling this option. See explanations in realhf/impl/model/backend/megatron.py.
Fix the padding value when gathering mini-batched generation outputs. If the padding value is not pad_token_id, the generation length calculated by the PPO interface will be incorrect.
Restrict the model saving handlers to be all trainable models, otherwise the request can be sent to models that are not instantiated yet (e.g., the actor used for generation with parameter reallocation).
Fix all examples to make them runnable.
The minimum batch size per DP rank should be n_mbs instead of 1 in the master worker.
Request evaluation and model saving outside coroutines when the experiment is abort to complete. This fixes the bug that the model at the last epoch will not be saved.
Resolve the issues of the generation interface mentioned in #59 .
Changes after review
Add a mini-batched PPO script example.
Fix the bug of pipeline mini-batched inference/generate.
Distinguish the named used for minibatch for pipelining and interfaces.
Patch Fixes:
Raise an error when using parameter reallocaiton with the DS backend.
Set
use_cuda_graph=True
in all example scripts and remove the error message in PPO/GRPO/Reinforce experiments when using CUDAGraph.Backend API change:
Add a
post_hook
argument in the forward (aka inference) API, which is a function that does post-processing on the output logits, e.g., collecting log probabilities from the logits. This is useful with mini-batched inference since it doesn't need to save all model outputs and can save a large amount of GPU memory.Unify the forward and eval_batch APIs. eval_batch is now a special case of forward with a post_hook collecting losses and statistics.
For inference and generate calls, the mini-batches are splitted in the outer loop. We call
engine.generate
orengine.inference
multiple times, which will have similar end-to-end latency but save GPU memory. For example, KV-cache and intermediate activations are not kept for all mini-batches at a time.In the
load_hf_tokenizer
function, settrust_remote_code=True
be default.Automatically amend IDs for datasets.
Bug Fixes:
The
overlap_param_gather
option in the Megatron backend now defaults to False rather than True. PPO w. parameter reallocation can be algorithmically incorrect when enabling this option. See explanations inrealhf/impl/model/backend/megatron.py
.Fix the padding value when gathering mini-batched generation outputs. If the padding value is not
pad_token_id
, the generation length calculated by the PPO interface will be incorrect.Restrict the model saving handlers to be all trainable models, otherwise the request can be sent to models that are not instantiated yet (e.g., the actor used for generation with parameter reallocation).
Fix all examples to make them runnable.
The minimum batch size per DP rank should be
n_mbs
instead of 1 in the master worker.Request evaluation and model saving outside coroutines when the experiment is abort to complete. This fixes the bug that the model at the last epoch will not be saved.
Resolve the issues of the generation interface mentioned in #59 .
Changes after review
Add a mini-batched PPO script example.
Fix the bug of pipeline mini-batched inference/generate.
Distinguish the named used for minibatch for pipelining and interfaces.