quic / ai-hub-models

The Qualcomm® AI Hub Models are a collection of state-of-the-art machine learning models optimized for performance (latency, memory etc.) and ready to deploy on Qualcomm® devices.
https://aihub.qualcomm.com
BSD 3-Clause "New" or "Revised" License
425 stars 58 forks source link

[Genie] fails to generate genie-compatible QNN binaries #98

Open taeyeonlee opened 2 days ago

taeyeonlee commented 2 days ago

Dear Qualcomm,

It fails to generate genie-compatible QNN binaries, according to the guide (https://github.com/quic/ai-hub-models/tree/main/qai_hub_models/models/llama_v2_7b_chat_quantized/gen_ondevice_llama) The error log is following. FileNotFoundError: Unable to find the model source file, invalid path: /mnt/hdd/QCT_AI_Hub/ai-hub-models/qai_hub_models/models/llama_v2_7b_chat_quantized/gen_ondevice_llama/export/intermediate_data/cpp_models/Llama2_PromptProcessor_1_Quantized.cpp

    There are the ONNX file : /export/intermediate_data/input_models/model_pp_0/Llama2_PromptProcessor_1_Quantized.aimet/Llama2_PromptProcessor_1_Quantized.onnx
    but, no CPP file : /export/intermediate_data/cpp_models/Llama2_PromptProcessor_1_Quantized.cpp

Error log


(qct_python310_VENV_root) taeyeon@taeyeon-PC:/hdd/hdd/QCT_AI_Hub/ai-hub-models/qai_hub_models/models/llama_v2_7b_chat_quantized/gen_ondevice_llama$ python gen_ondevice_llama.py --hub-model-id mjqyxjlvm,mdno2o9gq,m7n1x51rq,m7qk5ozxq,mpn7y95jn,mknj6orxq,mrmdwz90q,mrmdwo8oq --output-dir ./export --tokenizer-zip-path ./tokenizer.zip --target-gen snapdragon-gen3 --target-os android
Using previously extracted model
Generating model lib for split-0 pp
./export/intermediate_data/input_data/data_pp_0.h5 export/intermediate_data/input_data/data_pp_0/_data
/opt/qcom/aistack/qairt/2.25.0.240728/bin/x86_64-linux-clang/qnn-onnx-converter --input_network ./export/intermediate_data/input_models/model_pp_0/Llama2_PromptProcessor_1_Quantized.aimet/Llama2_PromptProcessor_1_Quantized.onnx --output_path ./export/intermediate_data/cpp_models/Llama2_PromptProcessor_1_Quantized.cpp --no_simplification --quantization_overrides ./export/intermediate_data/input_models/model_pp_0/Llama2_PromptProcessor_1_Quantized.aimet/Llama2_PromptProcessor_1_Quantized.encodings --preserve_io layout input_ids attention_mask position_ids_cos position_ids_sin layers_7_add_out_0 past_key_0_out past_value_0_out past_key_1_out past_value_1_out past_key_2_out past_value_2_out past_key_3_out past_value_3_out past_key_4_out past_value_4_out past_key_5_out past_value_5_out past_key_6_out past_value_6_out past_key_7_out past_value_7_out --input_layout attention_mask NONTRIVIAL --input_layout position_ids_cos NONTRIVIAL --input_layout position_ids_sin NONTRIVIAL --input_list export/intermediate_data/input_data/data_pp_0/_data/input_list.txt --input_dim input_ids 1,1024 --input_dim attention_mask 1,1,1024,1024 --input_dim position_ids_cos 1,1,1024,64 --input_dim position_ids_sin 1,1,1024,64 --input_dtype input_ids int32 --input_dtype attention_mask float32 --input_dtype position_ids_cos float32 --input_dtype position_ids_sin float32 --bias_bitwidth 32 --weight_bw 8 --act_bitwidth 16
2024-09-12 11:44:39,286 - 235 - INFO - Processing user provided quantization encodings:
2024-09-12 11:48:16,882 - 240 - WARNING - WARNING_CAST_TYPE: Only numerical type cast is supported. The op: /model/model/layers.0/input_layernorm/Cast will be interpreted at conversion time
2024-09-12 11:48:16,892 - 240 - WARNING - WARNING_CAST_TYPE: Only numerical type cast is supported. The op: /model/model/layers.0/input_layernorm/Cast_1 will be interpreted at conversion time
2024-09-12 11:48:17,879 - 240 - WARNING - WARNING_CAST_TYPE: Only numerical type cast is supported. The op: /model/model/layers.0/self_attn/Cast will be interpreted at conversion time
,,,
2024-09-12 11:48:27,826 - 240 - WARNING - WARNING_CAST_TYPE: Only numerical type cast is supported. The op: /model/model/layers.7/post_attention_layernorm/Cast_1 will be interpreted at conversion time
2024-09-12 11:48:37,409 - 235 - INFO - Processed 7623 quantization encodings
2024-09-12 11:50:31,235 - 240 - WARNING - --weight_bw option is deprecated, use --weights_bitwidth.
IrQuantizer: Param Quantizer should be set to symmetric for 32 bit biases. Will ignore param quantizer option: tf for biases
    28.2ms [  INFO ] Inferences will run in sync mode
    28.9ms [  INFO ] Initializing logging in the backend. Callback: [0x710b47a96770], Log Level: [3]
    29.0ms [  INFO ] No BackendExtensions lib provided;initializing NetRunBackend Interface
     2.4ms [  INFO ] [QNN_CPU] CpuBackend creation start
     2.4ms [  INFO ] [QNN_CPU] CpuBackend creation end
    31.4ms [WARNING] Unable to find a device with NetRunDeviceKeyDefault in Library NetRunBackendLibKeyDefault
    31.4ms [WARNING] Profile Logger with name = defaultKey doesn't exist! Returning nullptr
     3.8ms [  INFO ] [QNN_CPU] QnnContext create start
     3.8ms [  INFO ] [QNN_CPU] QnnContext create end
    33.2ms [  INFO ] Entering QuantizeRuntimeApp flow
    33.2ms [WARNING] Profile Logger with name = defaultKey doesn't exist! Returning nullptr
     4.3ms [  INFO ] [QNN_CPU] CpuGraph creation start
     4.7ms [  INFO ] [QNN_CPU] CpuGraph creation end
     4.7ms [  INFO ] [QNN_CPU] QnnGraph create end
  5407.6ms [  INFO ] [QNN_CPU] QnnGraph finalize start
Generating model lib...
/opt/qcom/aistack/qairt/2.25.0.240728/bin/x86_64-linux-clang/qnn-model-lib-generator -c ./export/intermediate_data/cpp_models/Llama2_PromptProcessor_1_Quantized.cpp -b ./export/intermediate_data/cpp_models/Llama2_PromptProcessor_1_Quantized.bin -t x86_64-linux-clang -o ./export/intermediate_data/model_libs -l Llama2_PromptProcessor_1_Quantized
2024-09-12 11:52:16,354 -    INFO - qnn-model-lib-generator: Model cpp file path  : export/intermediate_data/cpp_models/Llama2_PromptProcessor_1_Quantized.cpp
2024-09-12 11:52:16,354 -    INFO - qnn-model-lib-generator: Model bin file path  : export/intermediate_data/cpp_models/Llama2_PromptProcessor_1_Quantized.bin
2024-09-12 11:52:16,354 -    INFO - qnn-model-lib-generator: Library target       : [['x86_64-linux-clang']]
2024-09-12 11:52:16,354 -    INFO - qnn-model-lib-generator: Library name         : Llama2_PromptProcessor_1_Quantized
2024-09-12 11:52:16,354 -    INFO - qnn-model-lib-generator: Output directory     : export/intermediate_data/model_libs
Traceback (most recent call last):
  File "/opt/qcom/aistack/qairt/2.25.0.240728/bin/x86_64-linux-clang/qnn-model-lib-generator", line 495, in <module>
    main()
  File "/opt/qcom/aistack/qairt/2.25.0.240728/bin/x86_64-linux-clang/qnn-model-lib-generator", line 489, in main
    result = generator.build_targets(config)
  File "/opt/qcom/aistack/qairt/2.25.0.240728/bin/x86_64-linux-clang/qnn-model-lib-generator", line 393, in build_targets
    self._normalize_config(config)
  File "/opt/qcom/aistack/qairt/2.25.0.240728/bin/x86_64-linux-clang/qnn-model-lib-generator", line 361, in _normalize_config
    raise FileNotFoundError(f'Unable to find the model source file, invalid path: {config.model_cpp.absolute()}')
FileNotFoundError: Unable to find the model source file, invalid path: /mnt/hdd/QCT_AI_Hub/ai-hub-models/qai_hub_models/models/llama_v2_7b_chat_quantized/gen_ondevice_llama/export/intermediate_data/cpp_models/Llama2_PromptProcessor_1_Quantized.cpp
Traceback (most recent call last):
  File "/mnt/hdd/QCT_AI_Hub/ai-hub-models/qai_hub_models/models/llama_v2_7b_chat_quantized/gen_ondevice_llama/gen_ondevice_llama.py", line 67, in <module>
    main()
  File "/mnt/hdd/QCT_AI_Hub/ai-hub-models/qai_hub_models/models/llama_v2_7b_chat_quantized/gen_ondevice_llama/gen_ondevice_llama.py", line 61, in main
    generate_shared_bins(
  File "/mnt/hdd/QCT_AI_Hub/ai-hub-models/qai_hub_models/models/llama_v2_7b_chat_quantized/gen_ondevice_llama/utils.py", line 546, in generate_shared_bins
    generate_lib(
  File "/mnt/hdd/QCT_AI_Hub/ai-hub-models/qai_hub_models/models/llama_v2_7b_chat_quantized/gen_ondevice_llama/utils.py", line 276, in generate_lib
    raise RuntimeError("The QNN graph compiler did not produce the output file")
RuntimeError: The QNN graph compiler did not produce the output file
(qct_python310_VENV_root) taeyeon@taeyeon-PC:~/hdd/hdd/QCT_AI_Hub/ai-hub-models/qai_hub_models/models/llama_v2_7b_chat_quantized/gen_ondevice_llama$

Best regards,

gustavla commented 19 hours ago

Hi @taeyeonlee,

Something is happening that is causing the converter to abruptly exit before it has finished. I think the scripts should check for exit status before it tries to proceed, so we should fix that on our end. However, that will only make the failure reason more clear, it won't solve why it is failing.

The qnn-onnx-converter step should end with outputs like this:

2024-05-14 19:12:07,107 - 235 - INFO - Model CPP saved at: [...].cpp 
2024-05-14 19:12:07,108 - 235 - INFO - Model BIN saved at: [...].bin 
2024-05-14 19:12:07,136 - 235 - INFO - Conversion complete!

A few things to try:

Let us know what that yields and we'll take it from there. Sorry again about this!