how to convert any LLama model to MLIR format?

louwangzhiyuY commented 10 months ago

I have successfully executed the shark project using the llama large language model, and it works well. The model was sourced from shark_tank in MLIR format. I would like to run another large language model with shark, for which I need to convert it to MLIR format and compile it into a VMFB file. Could you provide the necessary steps or sample code for this process?

IanNod commented 10 months ago

The script https://github.com/nod-ai/SHARK-Turbine/blob/main/python/turbine_models/custom_models/stateless_llama.py has the flag --hf_model_name= to enable potentially any hugging face llama based llm. You can also pass the flag --compile_to= one of either torch, linalg, or vmfb to generate torch/linalg IR, or compile fully to vmfb respectively.

Example command for direct mlir of the model: python python/turbine_models/custom_models/stateless_llama.py --compile_to=vmfb --hf_model_name="meta-llama/Llama-2-7b-chat-hf" --precision="f32" --hf_auth_token=

Example if you want external parameters (can reduce RAM required significantly) or reduced precision or quantization: python python/turbine_models/custom_models/stateless_llama.py --compile_to=vmfb --hf_model_name="llSourcell/medllama2_7b" --precision=f16 --quantization=int4 --external_weights=safetensors

Then generating the quantized external weights as safetensors to use at inference: python python/turbine_models/gen_external_params/gen_external_params.py --quantization=int4 --precision=f16

So far has really only been tested with a few models such as "meta-llama/Llama-2-7b-chat-hf" and "llSourcell/medllama2_7b" so likely that all llama models won't work, but any feedback on anything that doesn't work is appreciated

louwangzhiyuY commented 10 months ago

thanks IanNod for reply.

I try to run the script you mentioned in my system. but always get fail.

command: python python/turbine_models/custom_models/stateless_llama.py --compile_to=vmfb --hf_model_name="llSourcell/medllama2_7b" --precision=f16 --quantization=int4 --external_weights=safetensors

message:

-turbine.env/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py:477 in wrapped:3447 in forward (_meta_registrations.py:3515 in common_meta_baddbmm_bmm) [2023-12-08 11:08:03,690] [1/0] torch.fx.experimental.symbolic_shapes: [INFO] eval Eq(s63 + 1, s62 + 1) [guard added] at .5 from /home/shark/shark/SHARK-Turbine/shark-turbine.env/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py:477 in wrapped:3557 in forward (_meta_registrations.py:3515 in common_meta_baddbmm_bmm) [2023-12-08 11:08:05,673] [1/0] torch.fx.experimental.symbolic_shapes: [INFO] produce_guards Traceback (most recent call last): File "/home/shark/shark/SHARK-Turbine/python/turbine_models/custom_models/stateless_llama.py", line 369, in modstr, = export_transformer_model( ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/shark/shark/SHARK-Turbine/python/turbine_models/custom_models/stateless_llama.py", line 272, in export_transformer_model flatbuffer_blob = ireec.compile_str( ^^^^^^^^^^^^^^^^^^ File "/home/shark/shark/SHARK-Turbine/shark-turbine.env/lib/python3.11/site-packages/iree/compiler/tools/core.py", line 299, in compile_str result = invoke_immediate(cl, immediate_input=input_bytes) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/shark/shark/SHARK-Turbine/shark-turbine.env/lib/python3.11/site-packages/iree/compiler/tools/binaries.py", line 198, in invoke_immediate raise CompilerToolError(process) iree.compiler.tools.binaries.CompilerToolError: Error invoking IREE compiler tool iree-compile Error code: 1 Diagnostics: iree-compile: Unknown command line argument '--iree-llvmcpu-enable-microkernels'. Try: '/home/shark/shark/SHARK-Turbine/shark-turbine.env/lib/python3.11/site-packages/iree/compiler/tools/../_mlir_libs/iree-compile --help' iree-compile: Did you mean '--iree-vmvx-enable-microkernels'?

Invoked with: iree-compile /home/shark/shark/SHARK-Turbine/shark-turbine.env/lib/python3.11/site-packages/iree/compiler/tools/../_mlir_libs/iree-compile - --iree-input-type=auto --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-embedded-linker-path=/home/shark/shark/SHARK-Turbine/shark-turbine.env/lib/python3.11/site-packages/iree/compiler/tools/../_mlir_libs/iree-lld --mlir-print-debuginfo --mlir-print-op-on-diagnostic=false --iree-input-type=torch --mlir-print-debuginfo --mlir-print-op-on-diagnostic=false --iree-llvmcpu-target-cpu-features=host --iree-llvmcpu-target-triple=x86_64-linux-gnu --iree-stream-resource-index-bits=64 --iree-vm-target-index-bits=64 --iree-codegen-check-ir-before-llvm-conversion=false --iree-opt-const-expr-hoisting=False --iree-llvmcpu-enable-microkernels

Need more information? Set IREE_SAVE_TEMPS=/some/dir in your environment to save all artifacts and reproducers.

do it need specify iree-compile version?

louwangzhiyuY commented 10 months ago

if add --device = vulakn, will get other error information.

louwangzhiyuY commented 10 months ago

fix the previous issue after comments flags.append("--iree-llvmcpu-enable-microkernels or change to ukernels

IanNod commented 10 months ago

The cpu issue was due to an iree bump that modified a flag and was fixed earlier with https://github.com/nod-ai/SHARK-Turbine/pull/226, so updating to the latest should fix that.

The vulkan issue looks like it is missing the target triple for which device you are compiling for. You can provide that by adding the flag --iree_target_triple="target triple for your device"

louwangzhiyuY commented 10 months ago

thanks IanNode, I can run the script with model llSourcell/medllama2_7b.

but I encounter another issue when i try to convert model Qwen/Qwen-7B-Chat.

command: python python/turbine_models/custom_models/stateless_llama.py --compile_to=vmfb --hf_model_name="Qwen/Qwen-7B-Chat" --precision=f16 --quantization=int4 --external_weights=safetensors

error message:

any ideal about the error?

nod-ai / SHARK-ModelDev

how to convert any LLama model to MLIR format? #214