hikariming / chat-dataset-baseline

人工精调的中文对话数据集和一段chatglm的微调代码
1.13k stars 95 forks source link

运行data_uilts的时候出错 #61

Open z1968357787 opened 1 year ago

z1968357787 commented 1 year ago

INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /tmp/tmpg1hbjeku INFO:torch.distributed.nn.jit.instantiator:Writing /tmp/tmpg1hbjeku/_remote_module_non_scriptable.py INFO:lightning_fabric.utilities.seed:Global seed set to 42 Traceback (most recent call last): File "/home/cike/zzp/alpaca/chatglm_finetuning/datautils.py", line 272, in tokenizer, config, ,_ = dataHelper.load_tokenizer_and_config(tokenizer_class_name=ChatGLMTokenizer,config_class_name=ChatGLMConfig) File "/home/cike/anaconda/envs/alpaca/lib/python3.9/site-packages/deep_training/data_helper/data_helper.py", line 257, in load_tokenizer_and_config tokenizer = load_tokenizer(tokenizer_name=tokenizer_name or model_args.tokenizer_name, File "/home/cike/anaconda/envs/alpaca/lib/python3.9/site-packages/deep_training/data_helper/data_module.py", line 29, in load_tokenizer tokenizer = class_name.from_pretrained(tokenizer_name, *tokenizer_kwargs) File "/home/cike/anaconda/envs/alpaca/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1804, in from_pretrained return cls._from_pretrained( File "/home/cike/anaconda/envs/alpaca/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1958, in _from_pretrained tokenizer = cls(init_inputs, **init_kwargs) File "/home/cike/zzp/alpaca/chatglm_finetuning/tokenization_chatglm.py", line 211, in init self.sp_tokenizer = SPTokenizer(vocab_file) File "/home/cike/zzp/alpaca/chatglm_finetuning/tokenization_chatglm.py", line 32, in init self.text_tokenizer = self._build_text_tokenizer(encode_special_tokens=False) File "/home/cike/zzp/alpaca/chatglm_finetuning/tokenization_chatglm.py", line 65, in _build_text_tokenizer self._configure_tokenizer( File "/home/cike/zzp/alpaca/chatglm_finetuning/tokenization_chatglm.py", line 61, in _configure_tokenizer text_tokenizer.refresh() File "/home/cike/anaconda/envs/alpaca/lib/python3.9/site-packages/icetk/text_tokenizer.py", line 31, in refresh self.sp.Load(model_proto=self.proto.SerializeToString()) File "/home/cike/anaconda/envs/alpaca/lib/python3.9/site-packages/sentencepiece/init.py", line 904, in Load return self.LoadFromSerializedProto(model_proto) File "/home/cike/anaconda/envs/alpaca/lib/python3.9/site-packages/sentencepiece/init.py", line 250, in LoadFromSerializedProto return _sentencepiece.SentencePieceProcessor_LoadFromSerializedProto(self, serialized) RuntimeError: Internal: [MASK] is already defined.