wangzhaode / llm-export

llm-export can export llm model to onnx.
Apache License 2.0
187 stars 21 forks source link

导出llama3-8B的tokenizer报错 #43

Closed cdliang11 closed 2 months ago

cdliang11 commented 2 months ago

模型:https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct

Traceback (most recent call last):
  File "llm_export.py", line 1257, in <module>
    llm_exporter = llm_models[model_type](args)
  File "llm_export.py", line 913, in __init__
    super().__init__(args)
  File "llm_export.py", line 101, in __init__
    self.sp_model = spm.SentencePieceProcessor(tokenizer_model)
  File "/jfs-hdfs/user/chengdong01.liang/anaconda3/envs/hbdk4/lib/python3.8/site-packages/sentencepiece/__init__.py", line 468, in Init
    self.Load(model_file=model_file, model_proto=model_proto)
  File "/jfs-hdfs/user/chengdong01.liang/anaconda3/envs/hbdk4/lib/python3.8/site-packages/sentencepiece/__init__.py", line 961, in Load
    return self.LoadFromFile(model_file)
  File "/jfs-hdfs/user/chengdong01.liang/anaconda3/envs/hbdk4/lib/python3.8/site-packages/sentencepiece/__init__.py", line 316, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError: Internal: could not parse ModelProto from ./Meta-Llama-3-8B-Instruct/tokenizer.model
cdliang11 commented 2 months ago

问题定位到了,https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct 这个仓库下,tokener_config.json 中显示使用的是PreTrainedTokenizerFast, 是标准的huggingface tokenzier, 而llm_export.py 中因为检测到模型文件夹中存在tokenizer.model文件,因此认为是sentencepiece格式。

报错的原因:llama3-8b原始模型是tiktoken格式,https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct 这个仓库转为了huggingface 格式,但没有删除tokenizer.model 😮‍💨