hiyouga / LLaMA-Factory

Efficiently Fine-Tune 100+ LLMs in WebUI (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
30.67k stars 3.78k forks source link

hello 我有看到#525 支持codegeex2 微调的issue已经closed, 请问下是已经支持了吗? 我在运行项目的时候, 模型选择 custom, 直接加载 codegeex2 的模型文件, 会有以下报错: AttributeError: 'ChatGLMTokenizer' object has no attribute 'tokenizer' #1885

Closed lldhliu closed 8 months ago

lldhliu commented 8 months ago

Reminder

Reproduction

CUDA_VISIBLE_DEVICES=0 python src/train_web.py

Expected behavior

No response

System Info

No response

Others

No response

hiyouga commented 8 months ago

更新 https://huggingface.co/THUDM/chatglm2-6b/blob/main/tokenization_chatglm.py https://github.com/THUDM/ChatGLM2-6B/issues/156#issuecomment-1619383374

lldhliu commented 8 months ago

@hiyouga 大佬 我已经更新了tokenization_chatglm.py, 问题还没解决, 我的 transformers版本按照咱的最新代码要求进行了升级, 我查下来好像是transformers的版本问题, 烦请再帮忙看看

hiyouga commented 8 months ago

你更新后的文件长什么样?

lldhliu commented 8 months ago

@hiyouga 大佬 已解决, 解决方案如下: tokenization_chatglm.py代码中

class ChatGLMTokenizer(PreTrainedTokenizer):
    vocab_files_names = {"vocab_file": "tokenizer.model"}

    model_input_names = ["input_ids", "attention_mask", "position_ids"]

    def __init__(self, vocab_file, padding_side="left", clean_up_tokenization_spaces=False, **kwargs):
        super().__init__(padding_side=padding_side, clean_up_tokenization_spaces=clean_up_tokenization_spaces, **kwargs)
        self.name = "GLMTokenizer"

        self.vocab_file = vocab_file
        self.tokenizer = SPTokenizer(vocab_file)
        self.special_tokens = {
            "<bos>": self.tokenizer.bos_id,
            "<eos>": self.tokenizer.eos_id,
            "<pad>": self.tokenizer.pad_id
        }
其中 super().__init__(padding_side=padding_side, clean_up_tokenization_spaces=clean_up_tokenization_spaces, **kwargs) 移到下面去就好了:
class ChatGLMTokenizer(PreTrainedTokenizer):
    vocab_files_names = {"vocab_file": "tokenizer.model"}

    model_input_names = ["input_ids", "attention_mask", "position_ids"]

    def __init__(self, vocab_file, padding_side="left", clean_up_tokenization_spaces=False, **kwargs):
        self.name = "GLMTokenizer"

        self.vocab_file = vocab_file
        self.tokenizer = SPTokenizer(vocab_file)
        self.special_tokens = {
            "<bos>": self.tokenizer.bos_id,
            "<eos>": self.tokenizer.eos_id,
            "<pad>": self.tokenizer.pad_id
        }
        super().__init__(padding_side=padding_side, clean_up_tokenization_spaces=clean_up_tokenization_spaces, **kwargs)
lldhliu commented 8 months ago

@hiyouga 大佬 经过上面调整后可以运行了, 但聊天生成全部是 111111: image

hiyouga commented 8 months ago

高级设置-模板选择 vanilla

lldhliu commented 8 months ago

@hiyouga 换了 vanilla 好像还不行 image

hiyouga commented 8 months ago

更新代码并使用 codegeex2 模板 a67a440644687dc2262134c0f2895f3ae42cae19 另外,CodeGeeX2 不是对话模型,只能提供续写功能,因此不会自动停止输出