shibing624 / MedicalGPT

MedicalGPT: Training Your Own Medical GPT Model with ChatGPT Training Pipeline. 训练医疗大模型,实现了包括增量预训练(PT)、有监督微调(SFT)、RLHF、DPO、ORPO。
Apache License 2.0
2.94k stars 451 forks source link

assert tokenzier_vocab_size > model_vocab_size #350

Closed sevenandseven closed 3 months ago

sevenandseven commented 3 months ago

Describe the Question

Please provide a clear and concise description of what the question is.

Hello, during the process of using LORA to fine-tune the chatglm3 base model, an inconsistency problem between the model vocabulary and the tokenization vocabulary occurred during the merging process. How can I solve this?

shibing624 commented 3 months ago

logger.info("Resize model embeddings to fit tokenizer") base_model.resize_token_embeddings(tokenzier_vocab_size)

sevenandseven commented 3 months ago

“Okay, thank you for your reply.”

sevenandseven commented 3 months ago

Image_20240320135926

Using your method, new problems have emerged.

shibing624 commented 3 months ago

手动改词表了吗?没改词表的话不会报tokenzier_vocab_size > model_vocab_size的问题,把transformers降级到4.28.1

sevenandseven commented 3 months ago

没有手动改词表,就是用的工程的run_pt使用的txt数据,对base模型进行二次微调的。我试试降低版本。