assert tokenzier_vocab_size > model_vocab_size

shibing624 / MedicalGPT

MedicalGPT: Training Your Own Medical GPT Model with ChatGPT Training Pipeline. 训练医疗大模型，实现了包括增量预训练(PT)、有监督微调(SFT)、RLHF、DPO、ORPO。

Apache License 2.0

2.94k stars 451 forks source link

assert tokenzier_vocab_size > model_vocab_size #350

Closed sevenandseven closed 3 months ago

sevenandseven commented 3 months ago

Describe the Question

Please provide a clear and concise description of what the question is.

Hello, during the process of using LORA to fine-tune the chatglm3 base model, an inconsistency problem between the model vocabulary and the tokenization vocabulary occurred during the merging process. How can I solve this?

shibing624 commented 3 months ago

logger.info("Resize model embeddings to fit tokenizer") base_model.resize_token_embeddings(tokenzier_vocab_size)

sevenandseven commented 3 months ago

“Okay, thank you for your reply.”

sevenandseven commented 3 months ago

Using your method, new problems have emerged.

shibing624 commented 3 months ago

手动改词表了吗？没改词表的话不会报tokenzier_vocab_size > model_vocab_size的问题，把transformers降级到4.28.1

sevenandseven commented 3 months ago

没有手动改词表，就是用的工程的run_pt使用的txt数据，对base模型进行二次微调的。我试试降低版本。