Open tiger-of-shawn opened 3 days ago
The error seems to be related to 'tie_word_embeddings.' I will try to work on a fix soon.
Hi, @tiger-of-shawn, thanks for your feedback! We have released the latest NeuroPilot Express SDK for ExecuTorch. This update includes optimizations specifically addressing the issue you highlighted. Please give it a try!
Hi, @tiger-of-shawn, thanks for your feedback! We have released the latest NeuroPilot Express SDK for ExecuTorch. This update includes optimizations specifically addressing the issue you highlighted. Please give it a try!
Thank you for your response; it’s working perfectly now.
I have run the sample application on MTK 9000, prefill 990 tokens/s, decode 61 tokens/s
I 00:00:01.045045 executorch:mtk_llama_executor_runner.cpp:194] Done analyzing prompt in 0.129182 sec (990.850118 tok/s) I 00:00:04.956007 executorch:mtk_llama_executor_runner.cpp:296] Token generation speed: 61.639103 tok/s
some logs:
source shell_scripts/export_llama.sh qwen2 "" "" "" llama3.txt
checkpoint_files: ['models/llm_models/weights/Qwen2.5-0.5B-Instruct/model.safetensors'] Preparing Model Calibration Inputs... Exporting Chunk 0 to PTE Getting pre autograd ATen Dialect Graph model info: Qwen2ModelChunk( (layers): ModuleList( (0-23): 24 x Qwen2DecoderLayer( (self_attn): Qwen2Attention( (q_proj): Linear(in_features=896, out_features=896, bias=True) (k_proj): Linear(in_features=896, out_features=128, bias=True) (v_proj): Linear(in_features=896, out_features=128, bias=True) (o_proj): Linear(in_features=896, out_features=896, bias=False) ) (mlp): Qwen2MLP( (gate_proj): Linear(in_features=896, out_features=4864, bias=False) (down_proj): Linear(in_features=4864, out_features=896, bias=False) (up_proj): Linear(in_features=896, out_features=4864, bias=False) ) (input_norm): RMSNorm() (post_attention_norm): RMSNorm() ) ) (norm): RMSNorm() (lm_head): Linear(in_features=896, out_features=151936, bias=False) )
W1015 10:29:36.177991 578378 torch/_export/init.py:64] +============================+ W1015 10:29:36.178128 578378 torch/_export/init.py:65] | !!! WARNING !!! | W1015 10:29:36.178169 578378 torch/_export/init.py:66] +============================+ W1015 10:29:36.178198 578378 torch/_export/init.py:67] capture_pre_autograd_graph() is deprecated and doesn't provide any function guarantee moving forward. W1015 10:29:36.178226 578378 torch/_export/init.py:68] Please switch to use torch.export.export_for_training instead. Batch: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:05<00:00, 1.86it/s] Calibrating Model: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:13<00:00, 13.90s/it] Getting ATen Dialect Graph Exporting Shape 128t512c to: pte/Qwen2.5-0.5B-Instruct_A16W4_1_chunks_128t512c/Qwen2.5-0.5B-Instruct_A16W4_1_chunks_128t512c_0.pte example_input shape: torch.Size([1, 128, 896]) Lowering to Edge Dialect Graph Delegating Edge Program to Neuropilot Backend
Traceback (most recent call last): File "/home/qwen/executorch/examples/mediatek/model_export_scripts/qwen2.py", line 491, in
main()
File "/home/qwen/executorch/examples/mediatek/model_export_scripts/qwen2.py", line 477, in main
export_to_et_ir(
File "/home/qwen/executorch/examples/mediatek/model_export_scripts/qwen2.py", line 362, in export_to_et_ir
delegated_program = edge_program.to_backend(partitioner)
File "/home/qwen/miniconda3/envs/et_qnn_2/lib/python3.10/site-packages/executorch/exir/program/_program.py", line 1288, in to_backend
new_edge_programs[name] = to_backend(program, partitioner)
File "/home/qwen/miniconda3/envs/et_qnn_2/lib/python3.10/functools.py", line 878, in wrapper
return dispatch(args[0].class)(*args, *kw)
File "/home/qwen/miniconda3/envs/et_qnn_2/lib/python3.10/site-packages/executorch/exir/backend/backendapi.py", line 387, in
tagged_graph_module = _partition_and_lower(
File "/home/qwen/miniconda3/envs/et_qnn_2/lib/python3.10/site-packages/executorch/exir/backend/backend_api.py", line 310, in _partition_and_lower
partitioned_module = _partition_and_lower_one_graph_module(
File "/home/qwen/miniconda3/envs/et_qnn_2/lib/python3.10/site-packages/executorch/exir/backend/backend_api.py", line 249, in _partition_and_lower_one_graph_module
lowered_submodule = to_backend(
File "/home/qwen/miniconda3/envs/et_qnn_2/lib/python3.10/functools.py", line 878, in wrapper
return dispatch(args[0].class)(args, **kw)
File "/home/qwen/miniconda3/envs/et_qnn_2/lib/python3.10/site-packages/executorch/exir/backend/backendapi.py", line 113, in
preprocess_result: PreprocessResult = cls.preprocess(
File "/home/qwen/miniconda3/envs/et_qnn_2/lib/python3.10/site-packages/executorch/backends/mediatek/preprocess.py", line 68, in preprocess
model_bytes = mtk_neuron.compile(mlir_str, " ".join(compile_options))
File "/home/qwen/miniconda3/envs/et_qnn_2/lib/python3.10/site-packages/mtk_neuron/mtk_neuron.py", line 127, in
compile raise RuntimeError(f'Compile error:\n{status["log"]}') RuntimeError: Compile error: NIR[1761]: FullyConnectedLayer ├ MDLA: Dimension should be <= 65535. Operand: 1 got <151936 x 896>. ├ MDLA: Dimension should be <= 65535. Result : 0 got <128 x 151936>. ├ EDPA: unsupported operation WARNING: Failed to process the supernode.