Patch Fixes:

Raise an error when using parameter reallocaiton with the DS backend.
Set use_cuda_graph=True in all example scripts and remove the error message in PPO/GRPO/Reinforce experiments when using CUDAGraph.
Backend API change:
- Add a post_hook argument in the forward (aka inference) API, which is a function that does post-processing on the output logits, e.g., collecting log probabilities from the logits. This is useful with mini-batched inference since it doesn't need to save all model outputs and can save a large amount of GPU memory.
- Unify the forward and eval_batch APIs. eval_batch is now a special case of forward with a post_hook collecting losses and statistics.
- For inference and generate calls, the mini-batches are splitted in the outer loop. We call engine.generate or engine.inference multiple times, which will have similar end-to-end latency but save GPU memory. For example, KV-cache and intermediate activations are not kept for all mini-batches at a time.
In the load_hf_tokenizer function, set trust_remote_code=True be default.
Automatically amend IDs for datasets.

Bug Fixes:

The overlap_param_gather option in the Megatron backend now defaults to False rather than True. PPO w. parameter reallocation can be algorithmically incorrect when enabling this option. See explanations in realhf/impl/model/backend/megatron.py.
Fix the padding value when gathering mini-batched generation outputs. If the padding value is not pad_token_id, the generation length calculated by the PPO interface will be incorrect.
Restrict the model saving handlers to be all trainable models, otherwise the request can be sent to models that are not instantiated yet (e.g., the actor used for generation with parameter reallocation).
Fix all examples to make them runnable.
The minimum batch size per DP rank should be n_mbs instead of 1 in the master worker.
Request evaluation and model saving outside coroutines when the experiment is abort to complete. This fixes the bug that the model at the last epoch will not be saved.
Resolve the issues of the generation interface mentioned in #59 .