hikariming / chat-dataset-baseline

人工精调的中文对话数据集和一段chatglm的微调代码
1.13k stars 95 forks source link

这个数据集是不是有点问题,使用merge.py的时候就会出问题 #58

Open z1968357787 opened 1 year ago

z1968357787 commented 1 year ago

File "/home/cike/anaconda/envs/alpaca/lib/python3.9/json/init.py", line 293, in load return loads(fp.read(), File "/home/cike/anaconda/envs/alpaca/lib/python3.9/json/init.py", line 346, in loads return _default_decoder.decode(s) File "/home/cike/anaconda/envs/alpaca/lib/python3.9/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/home/cike/anaconda/envs/alpaca/lib/python3.9/json/decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 112 column 1 (char 11779)

hikariming commented 1 year ago

遇到这个问题可能是哪个团队成员标注的时候,不小心在数据集末尾加了个逗号啥的,我们稍后检查一下哈

z1968357787 commented 1 year ago

好的好的,非常感谢你的回答,还有一个问题就是我跑那个成功案例的模型的时候,用英文的数据集,会出现一个问题,是这个代码有点问题吗? Traceback (most recent call last):   File "/home/cike/zzp/alpaca/chatglmfinetuning/train.py", line 121, in <module>     tokenizer, config, ,_ = dataHelper.load_tokenizer_and_config(tokenizer_class_name=ChatGLMTokenizer,config_class_name=ChatGLMConfig)   File "/home/cike/anaconda/envs/alpaca/lib/python3.9/site-packages/deep_training/data_helper/data_helper.py", line 257, in load_tokenizer_and_config     tokenizer = load_tokenizer(tokenizer_name=tokenizer_name or model_args.tokenizer_name,   File "/home/cike/anaconda/envs/alpaca/lib/python3.9/site-packages/deep_training/data_helper/data_module.py", line 29, in load_tokenizer     tokenizer = class_name.from_pretrained(tokenizer_name, *tokenizer_kwargs)   File "/home/cike/anaconda/envs/alpaca/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1804, in from_pretrained     return cls._from_pretrained(   File "/home/cike/anaconda/envs/alpaca/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1958, in _from_pretrained     tokenizer = cls(init_inputs, **init_kwargs)   File "/home/cike/zzp/alpaca/chatglm_finetuning/tokenization_chatglm.py", line 211, in init     self.sp_tokenizer = SPTokenizer(vocab_file)   File "/home/cike/zzp/alpaca/chatglm_finetuning/tokenization_chatglm.py", line 32, in init     self.text_tokenizer = self._build_text_tokenizer(encode_special_tokens=False)   File "/home/cike/zzp/alpaca/chatglm_finetuning/tokenization_chatglm.py", line 65, in _build_text_tokenizer     self._configure_tokenizer(   File "/home/cike/zzp/alpaca/chatglm_finetuning/tokenization_chatglm.py", line 61, in _configure_tokenizer     text_tokenizer.refresh()   File "/home/cike/anaconda/envs/alpaca/lib/python3.9/site-packages/icetk/text_tokenizer.py", line 31, in refresh     self.sp.Load(model_proto=self.proto.SerializeToString())   File "/home/cike/anaconda/envs/alpaca/lib/python3.9/site-packages/sentencepiece/init.py", line 904, in Load     return self.LoadFromSerializedProto(model_proto)   File "/home/cike/anaconda/envs/alpaca/lib/python3.9/site-packages/sentencepiece/init.py", line 250, in LoadFromSerializedProto     return _sentencepiece.SentencePieceProcessor_LoadFromSerializedProto(self, serialized) RuntimeError: Internal: [MASK] is already defined.

 

变质的水果糖 @.***

 

------------------ 原始邮件 ------------------ 发件人: @.>; 发送时间: 2023年4月7日(星期五) 上午10:10 收件人: @.>; 抄送: @.>; @.>; 主题: Re: [hikariming/alpaca_chinese_dataset] 这个数据集是不是有点问题,使用merge.py的时候就会出问题 (Issue #58)

遇到这个问题可能是哪个团队成员标注的时候,不小心在数据集末尾加了个逗号啥的,我们稍后检查一下哈

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

hikariming commented 1 year ago

你好,有可能的,因为这个数据集是人工手打的,有时候标点等可能会出问题,导致合并出错。 我回去试一下,改一下merge.py。