hello 我有看到#525 支持codegeex2 微调的issue已经closed, 请问下是已经支持了吗? 我在运行项目的时候, 模型选择 custom, 直接加载 codegeex2 的模型文件, 会有以下报错: AttributeError: 'ChatGLMTokenizer' object has no attribute 'tokenizer'

lldhliu commented 8 months ago

Reminder

[X] I have read the README and searched the existing issues.

Reproduction

CUDA_VISIBLE_DEVICES=0 python src/train_web.py

Expected behavior

No response

System Info

No response

Others

No response

hiyouga commented 8 months ago

更新 https://huggingface.co/THUDM/chatglm2-6b/blob/main/tokenization_chatglm.py https://github.com/THUDM/ChatGLM2-6B/issues/156#issuecomment-1619383374

lldhliu commented 8 months ago

@hiyouga 大佬我已经更新了tokenization_chatglm.py, 问题还没解决, 我的 transformers版本按照咱的最新代码要求进行了升级, 我查下来好像是transformers的版本问题, 烦请再帮忙看看

hiyouga commented 8 months ago

你更新后的文件长什么样？

lldhliu commented 8 months ago

@hiyouga 大佬已解决, 解决方案如下: tokenization_chatglm.py代码中

class ChatGLMTokenizer(PreTrainedTokenizer):
    vocab_files_names = {"vocab_file": "tokenizer.model"}

    model_input_names = ["input_ids", "attention_mask", "position_ids"]

    def __init__(self, vocab_file, padding_side="left", clean_up_tokenization_spaces=False, **kwargs):
        super().__init__(padding_side=padding_side, clean_up_tokenization_spaces=clean_up_tokenization_spaces, **kwargs)
        self.name = "GLMTokenizer"

        self.vocab_file = vocab_file
        self.tokenizer = SPTokenizer(vocab_file)
        self.special_tokens = {
            "<bos>": self.tokenizer.bos_id,
            "<eos>": self.tokenizer.eos_id,
            "<pad>": self.tokenizer.pad_id
        }

其中 super().__init__(padding_side=padding_side, clean_up_tokenization_spaces=clean_up_tokenization_spaces, **kwargs) 移到下面去就好了:

class ChatGLMTokenizer(PreTrainedTokenizer):
    vocab_files_names = {"vocab_file": "tokenizer.model"}

    model_input_names = ["input_ids", "attention_mask", "position_ids"]

    def __init__(self, vocab_file, padding_side="left", clean_up_tokenization_spaces=False, **kwargs):
        self.name = "GLMTokenizer"

        self.vocab_file = vocab_file
        self.tokenizer = SPTokenizer(vocab_file)
        self.special_tokens = {
            "<bos>": self.tokenizer.bos_id,
            "<eos>": self.tokenizer.eos_id,
            "<pad>": self.tokenizer.pad_id
        }
        super().__init__(padding_side=padding_side, clean_up_tokenization_spaces=clean_up_tokenization_spaces, **kwargs)

lldhliu commented 8 months ago

@hiyouga 大佬经过上面调整后可以运行了, 但聊天生成全部是 111111:

hiyouga commented 8 months ago

高级设置-模板选择 vanilla

lldhliu commented 8 months ago

@hiyouga 换了 vanilla 好像还不行

hiyouga commented 8 months ago

更新代码并使用 codegeex2 模板 a67a440644687dc2262134c0f2895f3ae42cae19 另外，CodeGeeX2 不是对话模型，只能提供续写功能，因此不会自动停止输出

hiyouga / LLaMA-Factory