microsoft / DeepSpeedExamples

Example models using DeepSpeed
Apache License 2.0
6.08k stars 1.04k forks source link

step3_rlhf_finetuning and two tokenizers #577

Open GenVr opened 1 year ago

GenVr commented 1 year ago

Hello. I'm trying to train a GPT-J 6B, and as a critical model I have trained several networks of different/similar families (gpt2, gpt-neo, bloom, ...) I know that in step 3 only a tokenizer is used, so with gpt-j I get this error #512. What critical model can I use with gpt-j? Thanks.

xiaoxiawu-microsoft commented 1 year ago

Hi @GenVr, thanks for the question. If your critic model uses a different tokenizer from your actor model, you may add an extra tokenizer and pass to this function https://github.com/microsoft/DeepSpeedExamples-internal/blob/master/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py#L310 and make sure you have repeated the same process as the existing tokenization process https://github.com/microsoft/DeepSpeedExamples-internal/blob/master/applications/DeepSpeed-Chat/training/utils/data/data_utils.py#L212 That is, whenever you see the existing tokenizer doing the tokenization, you do the same with your added tokenization. Hope this helps

xiaoxiawu-microsoft commented 1 year ago

@GenVr I would suggest to still use the same tokenizer. For critic model training, if you want to do small model, I would suggest you may load the origin llama-7b and reduce the number of layers to be 1/8 or 1/4. We have layer reduction technique training (with knowledge distillation). To see how we do layer-reduction, please check. https://github.com/microsoft/DeepSpeedExamples/blob/master/compression/bert/bash_script/layer_reduction.sh

liuaiting commented 1 year ago

May I ask when the official can support loading different tokenziers for actor and critic?@xiaoxiawu-microsoft