Open louwangzhiyuY opened 10 months ago
The script https://github.com/nod-ai/SHARK-Turbine/blob/main/python/turbine_models/custom_models/stateless_llama.py has the flag --hf_model_name=
Example command for direct mlir of the model:
python python/turbine_models/custom_models/stateless_llama.py --compile_to=vmfb --hf_model_name="meta-llama/Llama-2-7b-chat-hf" --precision="f32" --hf_auth_token=
Example if you want external parameters (can reduce RAM required significantly) or reduced precision or quantization: python python/turbine_models/custom_models/stateless_llama.py --compile_to=vmfb --hf_model_name="llSourcell/medllama2_7b" --precision=f16 --quantization=int4 --external_weights=safetensors
Then generating the quantized external weights as safetensors to use at inference: python python/turbine_models/gen_external_params/gen_external_params.py --quantization=int4 --precision=f16
So far has really only been tested with a few models such as "meta-llama/Llama-2-7b-chat-hf" and "llSourcell/medllama2_7b" so likely that all llama models won't work, but any feedback on anything that doesn't work is appreciated
thanks IanNod for reply.
I try to run the script you mentioned in my system. but always get fail.
command: python python/turbine_models/custom_models/stateless_llama.py --compile_to=vmfb --hf_model_name="llSourcell/medllama2_7b" --precision=f16 --quantization=int4 --external_weights=safetensors
message:
-turbine.env/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py:477 in wrapped:3447 in forward (_meta_registrations.py:3515 in common_meta_baddbmm_bmm)
[2023-12-08 11:08:03,690] [1/0] torch.fx.experimental.symbolic_shapes: [INFO] eval Eq(s63 + 1, s62 + 1) [guard added] at
Invoked with: iree-compile /home/shark/shark/SHARK-Turbine/shark-turbine.env/lib/python3.11/site-packages/iree/compiler/tools/../_mlir_libs/iree-compile - --iree-input-type=auto --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-embedded-linker-path=/home/shark/shark/SHARK-Turbine/shark-turbine.env/lib/python3.11/site-packages/iree/compiler/tools/../_mlir_libs/iree-lld --mlir-print-debuginfo --mlir-print-op-on-diagnostic=false --iree-input-type=torch --mlir-print-debuginfo --mlir-print-op-on-diagnostic=false --iree-llvmcpu-target-cpu-features=host --iree-llvmcpu-target-triple=x86_64-linux-gnu --iree-stream-resource-index-bits=64 --iree-vm-target-index-bits=64 --iree-codegen-check-ir-before-llvm-conversion=false --iree-opt-const-expr-hoisting=False --iree-llvmcpu-enable-microkernels
Need more information? Set IREE_SAVE_TEMPS=/some/dir in your environment to save all artifacts and reproducers.
do it need specify iree-compile version?
if add --device = vulakn, will get other error information.
fix the previous issue after comments flags.append("--iree-llvmcpu-enable-microkernels or change to ukernels
The cpu issue was due to an iree bump that modified a flag and was fixed earlier with https://github.com/nod-ai/SHARK-Turbine/pull/226, so updating to the latest should fix that.
The vulkan issue looks like it is missing the target triple for which device you are compiling for. You can provide that by adding the flag --iree_target_triple="target triple for your device"
thanks IanNode, I can run the script with model llSourcell/medllama2_7b.
but I encounter another issue when i try to convert model Qwen/Qwen-7B-Chat.
command: python python/turbine_models/custom_models/stateless_llama.py --compile_to=vmfb --hf_model_name="Qwen/Qwen-7B-Chat" --precision=f16 --quantization=int4 --external_weights=safetensors
error message:
any ideal about the error?
I have successfully executed the shark project using the llama large language model, and it works well. The model was sourced from shark_tank in MLIR format. I would like to run another large language model with shark, for which I need to convert it to MLIR format and compile it into a VMFB file. Could you provide the necessary steps or sample code for this process?