[FEATURE]: Graphic card ram friendly PPO training for big model(larger than 2B)

Describe the feature

The PPO training needs to maintain 4 models in memory at the same time. The original implementation keep the reward/actor critic/initial model in video ram at the same time. The Actor/Initial models' outputs are ids which means actions for Reward/Critic model. If reward model and actor model don't share the same tokenizer, the Ids mean nothing for reward model.

Even for the same model like bloom, developers can't keep the strong assumption that different scale models share the same tokenizer. For an example, bloom7b-mt doesn't need to share the same tokenizer with bloom-560m.

Things get even worse if we only have one LLM, like ChatGLM-6B. We even don't have chance to bet a smaller model has the same tokenizer.

So a video ram friendly PPO trainer is needed, so we only need to keep on model in video ram to do the training.

I have finished the codes and Readme doc in my fork. Later I'll submit a PR for this feature.

hpcaitech / ColossalAI

[FEATURE]: Graphic card ram friendly PPO training for big model(larger than 2B) #3566

Describe the feature