官方llama2-7b模型转换bmodel推理结果异常

tensorflowt commented 10 months ago

问题描述：

目前按照sophgo官方提供的llama2-7b模型进行onnx模型转换再转bmode，发现int4、int8量化后的模型推理结果均异常，具体如下：

int8的bmodel推理结果：问题描述：就是持续等待，不出结果，然后过一会儿就卡死退出盒子访问，返回宿主机。

int4的bmodel推理结果：

但是如果用sophgo官方提供的llama2-7b的int4模型推理结果就正常。

自己转换的int4、int8的bmodel：

int8 链接: https://pan.baidu.com/s/1ClPun3dT5nupbtJdud3KJA 提取码: f22f int4 链接: https://pan.baidu.com/s/145c4aTpsX3XgyEgyfwJMyA 提取码: kyug

请sophgo的同学帮忙尽快解决一下，感谢🙏

gfwsbsbsb commented 10 months ago

也许是mlir版本问题。 https://github.com/sophgo/Llama2-TPU 您可以参照这个repo编译模型，如果不行，在这个repo里提issue

WaitDumplings commented 10 months ago

Lm_head 和 embedding请使用FP16格式，只有block需要用int4/int8

tensorflowt commented 10 months ago

也许是mlir版本问题。 https://github.com/sophgo/Llama2-TPU 您可以参照这个repo编译模型，如果不行，在这个repo里提issue

请问Llama2-TPU/compile/compile.sh 和 sophon-demo-release/sample/Llama2/script/compile/compile.sh这两者的区别是？

当前是基于sophon-demo-release/sample/Llama2/script/compile/compile.sh进行转换的。

https://github.com/sophgo/sophon-demo https://github.com/sophgo/Llama2-TPU

WaitDumplings commented 10 months ago

可以参考Llama2-TPU/compile/compile.sh，因为是工具链同事后期update过的。转完bmodel以后可以继续参考这边的步骤。其实两边原理是一样（后续的compile也已经更新和那边的统一了）

tensorflowt commented 10 months ago

使用了Llama2-TPU/compile/compile.sh进行模型转换，还是遇到了同样的问题。

WaitDumplings commented 10 months ago

@tensorflowt 在compile的时候看一下lmhead. 和 embedding是否是FP16的？这个应该是精度没对上。

tensorflowt commented 10 months ago

compile.sh脚本信息如下： compile.sh.zip 执行交换脚本如下： ./compile.sh --mode int8 --name llama2-7b

WaitDumplings commented 10 months ago

您这边可以联系算能的销服同事要一下最新的mlir

tensorflowt commented 10 months ago

官网中的这个工程可以吗？https://github.com/sophgo/tpu-mlir 我记得好像就是下载的这个工程

WaitDumplings commented 10 months ago

应该是可以的，您这边可以先看看转出来的onnx，如果只转单个block是否可以和torch的模型对齐，如果onnx也是对齐的在继续讨论是否有别的问题

tensorflowt commented 10 months ago

好的，然后还有个问题想确认一下，就是目前提供的torch转onnx存在2个脚本，其中一个脚本来源于：https://github.com/sophgo/Llama2-TPU/blob/main/compile/export_onnx_fast.py 另外一个来源于：https://github.com/sophgo/sophon-demo/blob/release/sample/Llama2/script/compile/export_onnx.py 这两者有什么区别吗？目前export_onnx.py可以正常转换成功，但是export_onnx_fast.py转不成功！

tensorflowt commented 10 months ago

错误日志信息如下： error

WaitDumplings commented 10 months ago

目的都是转出bmodel，您也可以参考后者的全流程，当得到bmodel后在按照本例程编译

WaitDumplings commented 10 months ago

错误应该是来源于512表示seq length，但是您这里将seq length输入成了block数32.请仔细核对代码

tensorflowt commented 10 months ago

你好！目前我们这边按照如下流程进行模型转换并推理，发现结果还是异常，具体如下：

tpu-mlir 使用最新的master分支：git@github.com:sophgo/tpu-mlir.git
sophon-demo使用最新的master分支：git@github.com:sophgo/sophon-demo.git
Llama2-TPU使用最新的master分支：git@github.com:sophgo/Llama2-TPU.git

实现流程：

torch转onnx
onnx量化bmodel

然后在端侧推理发现结果还是异常，替换提供的百度云中的bmodel推理结果正常。

请问这块可能是什么原因导致的呢？

有什么方法解决呢？

tensorflowt commented 10 months ago

应该是可以的，您这边可以先看看转出来的onnx，如果只转单个block是否可以和torch的模型对齐，如果onnx也是对齐的在继续讨论是否有别的问题

请问单个block模型怎么和torch整个模型做对齐？这块有什么推荐的方法吗？

WaitDumplings commented 9 months ago

int4的话，我记得还需要加一个group size的参数，我去让这边更新一下https://github.com/sophgo/Llama2-TPU 然后你在重新编int4的。不加group size是per channel的量化方式，LLM精度会很差

sophgo / sophon-demo

官方llama2-7b模型转换bmodel推理结果异常 #14