Open KarpovVolodymyr opened 8 months ago
Check this, seems you did not run optimization successfully.
In your case, only argmax_sampling
got optimized, the llama_v2
did not, right?
From the log, I cannot tell why the optimization failed, could you help please attach the completed log?
I alsp have an AMD GPU and I too have a similar issue regarding the same steps: https://community.amd.com/t5/ai/how-to-running-optimized-llama2-with-microsoft-directml-on-amd/ba-p/645190 Was this issue solved? If yes, how?
My Olive Logs: Optimizing llama_v2 [2024-03-07 15:56:48,646] [INFO] [accelerator.py:208:create_accelerators] Running workflow on accelerator specs: gpu-dml [2024-03-07 15:56:48,647] [INFO] [engine.py:116:initialize] Using cache directory: cache [2024-03-07 15:56:48,647] [INFO] [engine.py:272:run] Running Olive on accelerator: gpu-dml [2024-03-07 15:56:48,669] [INFO] [engine.py:862:_run_pass] Running pass convert:OnnxConversion C:\Users\anaconda3\envs\llama2_Optimize\lib\site-packages\torch\onnx_internal\jit_utils.py:307: UserWarning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\jit\passes\onnx\constant_fold.cpp:181.) _C._jit_pass_onnx_node_shape_type_inference(node, params_dict, opset_version) C:\Users\anaconda3\envs\llama2_Optimize\lib\site-packages\torch\onnx\utils.py:702: UserWarning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\jit\passes\onnx\constant_fold.cpp:181.) _C._jit_pass_onnx_graph_shape_type_inference( C:\Users\anaconda3\envs\llama2_Optimize\lib\site-packages\torch\onnx\utils.py:1209: UserWarning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\jit\passes\onnx\constant_fold.cpp:181.) _C._jit_pass_onnx_graph_shape_type_inference( [2024-03-07 16:02:23,506] [WARNING] [common.py:108:model_proto_to_file] Model is too large to save as a single file but 'save_as_external_data' is False. Saved tensors as external data regardless. [2024-03-07 16:06:51,058] [WARNING] [common.py:108:model_proto_to_file] Model is too large to save as a single file but 'save_as_external_data' is False. Saved tensors as external data regardless. [2024-03-07 16:06:51,127] [INFO] [engine.py:952:_run_pass] Pass convert:OnnxConversion finished in 602.456794 seconds [2024-03-07 16:06:51,139] [INFO] [engine.py:862:_run_pass] Running pass optimize:OrtTransformersOptimization
I have the same issue, following the same instructions as Karjhan.
Windows 11, AMD CPU, AMD 6700 XT. Using olive-ai-0.50, and
Olive logs: Optimizing llama_v2 [2024-03-07 19:56:03,993] [INFO] [accelerator.py:208:create_accelerators] Running workflow on accelerator specs: gpu-dml [2024-03-07 19:56:03,993] [INFO] [engine.py:116:initialize] Using cache directory: cache [2024-03-07 19:56:03,993] [INFO] [engine.py:272:run] Running Olive on accelerator: gpu-dml [2024-03-07 19:56:04,026] [INFO] [engine.py:862:_run_pass] Running pass convert:OnnxConversion [2024-03-07 19:56:04,034] [INFO] [engine.py:896:_run_pass] Loaded model from cache: 1_OnnxConversion-9c3612d31e59051b1903b377f456134d-dc5fbbbe422d406cc8fcef71d99251a4 from cache\runs [2024-03-07 19:56:04,034] [INFO] [engine.py:862:_run_pass] Running pass optimize:OrtTransformersOptimization
Sometimes, I get the following message, as well:
[2024-03-07 19:36:13,073] [WARNING] [common.py:108:model_proto_to_file] Model is too large to save as a single file but 'save_as_external_data' is False. Saved tensors as external data regardless.
@PatriceVignola Could you help take a look?
These are classic OOM symptoms when running the script. Unfortunately the ORT optimizer that Olive uses needs way more memory than in should, which results in those OOM crashes without error messages.
Usually, I would recommend between 200gb and 300gb of RAM (which can include your pagefile). On my machine, I have 128gb of RAM and a pagefile of about 150gb and it takes around 30 minutes to go through the conversion and optimization process. It can also be done with less physical memory (and bigger pagefile size), but it might take longer.
That makes sense, thank you, I don't have that much memory + pagefile available. I'll keep an eye on the repository for any future updates that reduce the memory required.
Tested on two different state of the art systems and i am trap in the error:
onnxruntime.capi.onnxruntime_pybind11_state.InvalidProtobuf: [ONNXRuntimeError] : 7 : INVALID_PROTOBUF : Load model from C:\Olive\examples\directml\llama_v2\cache\models\3_OptimumMerging-2-29928aa56000b48c6135423fa1102e45-gpu-dml\output_model\decoder_model_merged.onnx failed:Protobuf parsing failed.
i wonder if thi sis the cause ?
[2024-04-11 10:11:58,163] [INFO] [engine.py:873:_run_pass] Running pass merge:OptimumMerging
Merged ONNX model exceeds 2GB, the model will not be checked without save_path
given.
LOG:
C:\Olive\examples\directml\llama_v2>python llama_v2.py
Optimizing argmax_sampling [2024-04-11 09:06:46,186] [INFO] [run.py:246:run] Loading Olive module configuration from: C:\Olive\olive\olive_config.json [2024-04-11 09:06:46,186] [INFO] [accelerator.py:324:create_accelerators] Running workflow on accelerator specs: gpu-dml [2024-04-11 09:06:46,186] [INFO] [run.py:199:run_engine] Importing pass module OnnxConversion [2024-04-11 09:06:46,197] [INFO] [engine.py:115:initialize] Using cache directory: cache [2024-04-11 09:06:46,197] [INFO] [engine.py:271:run] Running Olive on accelerator: gpu-dml [2024-04-11 09:06:46,202] [INFO] [engine.py:873:_run_pass] Running pass convert:OnnxConversion [2024-04-11 09:06:47,720] [INFO] [engine.py:960:_run_pass] Pass convert:OnnxConversion finished in 1.518627 seconds [2024-04-11 09:06:47,727] [INFO] [engine.py:851:_run_passes] Run model evaluation for the final model... [2024-04-11 09:06:48,244] [INFO] [engine.py:370:run_accelerator] Save footprint to footprints\argmax_sampling_gpu-dml_footprints.json. [2024-04-11 09:06:48,244] [INFO] [engine.py:288:run] Run history for gpu-dml: [2024-04-11 09:06:48,244] [INFO] [engine.py:576:dump_run_history] run history: +------------------------------------------------------------------------------------+----------------------------------+------+ | model_id | parent_model_id | from_pass | duration_sec | metrics | +====================================================================================+==================================+======+ | 14dc | | | | | +------------------------------------------------------------------------------------+----------------------------------+------+ | 0_OnnxConversion-14dc-818 | 14dc | OnnxConversion | 1.51863 | "latency-avg": 0.4114| +------------------------------------------------------------------------------------+----------------------------------+------+ [2024-04-11 09:06:48,244] [INFO] [engine.py:303:run] No packaging config provided, skip packaging artifacts Optimized Model : C:\Olive\examples\directml\llama_v2\cache\models\0_OnnxConversion-14dc7b7c3125d3ad1222f0b9e2e5b807-8183e3a10c90bb4a9507d579143be30e\output_model\model.onnx
Optimizing llama_v2 [2024-04-11 09:06:48,260] [INFO] [run.py:246:run] Loading Olive module configuration from: C:\Olive\olive\olive_config.json [2024-04-11 09:06:48,260] [INFO] [accelerator.py:324:create_accelerators] Running workflow on accelerator specs: gpu-dml [2024-04-11 09:06:48,260] [INFO] [run.py:199:run_engine] Importing pass module OnnxConversion [2024-04-11 09:06:48,260] [INFO] [run.py:199:run_engine] Importing pass module OrtTransformersOptimization [2024-04-11 09:06:48,260] [INFO] [run.py:199:run_engine] Importing pass module OptimumMerging [2024-04-11 09:06:48,260] [INFO] [engine.py:115:initialize] Using cache directory: cache [2024-04-11 09:06:48,260] [INFO] [engine.py:271:run] Running Olive on accelerator: gpu-dml [2024-04-11 09:06:48,292] [INFO] [engine.py:873:_run_pass] Running pass convert:OnnxConversion C:\Users\t\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\onnx_internal\jit_utils.py:307: UserWarning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied. (Triggered internally at ..\torch\csrc\jit\passes\onnx\constant_fold.cpp:181.) _C._jit_pass_onnx_node_shape_type_inference(node, params_dict, opset_version) C:\Users\t\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\onnx\utils.py:702: UserWarning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied. (Triggered internally at ..\torch\csrc\jit\passes\onnx\constant_fold.cpp:181.) _C._jit_pass_onnx_graph_shape_type_inference( C:\Users\t\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\onnx\utils.py:1209: UserWarning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied. (Triggered internally at ..\torch\csrc\jit\passes\onnx\constant_fold.cpp:181.) _C._jit_pass_onnx_graph_shape_type_inference( [2024-04-11 09:39:36,710] [INFO] [engine.py:960:_run_pass] Pass convert:OnnxConversion finished in 1968.418042 seconds [2024-04-11 09:39:36,757] [INFO] [engine.py:873:_run_pass] Running pass optimize:OrtTransformersOptimization [2024-04-11 10:11:58,085] [INFO] [engine.py:960:_run_pass] Pass optimize:OrtTransformersOptimization finished in 1941.281035 seconds
[2024-04-11 10:11:58,163] [INFO] [engine.py:873:_run_pass] Running pass merge:OptimumMerging
Merged ONNX model exceeds 2GB, the model will not be checked without save_path
given.
[2024-04-11 10:17:38,660] [ERROR] [engine.py:955:_run_pass] Pass run failed.
Traceback (most recent call last):
File "C:\Olive\olive\engine\engine.py", line 943, in _run_pass
output_model_config = host.run_pass(p, input_model_config, data_root, output_model_path, pass_search_point)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Olive\olive\systems\local.py", line 31, in run_pass
output_model = the_pass.run(model, data_root, output_model_path, point)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Olive\olive\passes\olive_pass.py", line 216, in run
output_model = self._run_for_config(model, data_root, config, output_model_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Olive\olive\passes\onnx\optimum_merging.py", line 85, in _run_for_config
onnxruntime.InferenceSession(output_model_path, sess_options, providers=[execution_provider])
File "C:\Users\t\AppData\Local\Programs\Python\Python311\Lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in init
self._create_inference_session(providers, provider_options, disabled_optimizers)
File "C:\Users\t\AppData\Local\Programs\Python\Python311\Lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 472, in _create_inference_session
sess = C.InferenceSession(session_options, self._model_path, True, self._read_config_from_model)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
onnxruntime.capi.onnxruntime_pybind11_state.InvalidProtobuf: [ONNXRuntimeError] : 7 : INVALID_PROTOBUF : Load model from
C:\Olive\examples\directml\llama_v2\cache\models\3_OptimumMerging-2-29928aa56000b48c6135423fa1102e45-gpu-dml\output_model\decoder_model_merged.onnx failed:Protobuf parsing failed.
I ran :python llama_v2.py error message is here:
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
[2024-04-25 19:58:01,158] [INFO] [engine.py:279:run] Run history for gpu-dml:
[2024-04-25 19:58:01,159] [INFO] [engine.py:567:dump_run_history] run history:
+----------------------------------+-------------------+-------------+----------------+-----------+
| model_id | parent_model_id | from_pass | duration_sec | metrics |
+==================================+===================+=============+================+===========+
| 9c3612d31e59051b1903b377f456134d | | | | |
+----------------------------------+-------------------+-------------+----------------+-----------+
[2024-04-25 19:58:01,159] [INFO] [engine.py:294:run] No packaging config provided, skip packaging artifacts
Traceback (most recent call last):
File "C:\Users\Administrator\olive\examples\directml\llama_v2\llama_v2.py", line 217, in
(C:\ProgramData\Anaconda3\envs\llama2_Optimize) C:\Users\Administrator\olive\examples\directml\llama_v2>
Describe the bug Hello. I was following the steps from this guide https://community.amd.com/t5/ai/how-to-running-optimized-llama2-with-microsoft-directml-on-amd/ba-p/645190
at the end of step 2, when I run
python llama_v2.py --optimize
command, the running just stops each time if I try to re-run it again at the same place. It generates model.onnx file inOlive\examples\directml\llama_v2\cache\models\0_OnnxConversion-14dc7b7c3125d3ad1222f0b9e2e5b807-dc5fbbbe422d406cc8fcef71d99251a4\output_model\model.onnx
and there is nomodels/optimized/llama_v2
folder.As I understand there should be these files
I don't have any errors so it is difficult to understand is something wrong or not
Could you please give me some advice what I do wrong?
To Reproduce Steps to reproduce the behavior.
Expected behavior From the description I expect this Once the script successfully completes, the optimized ONNX pipeline will be stored under models/optimized/llama_v2.
Olive config Add Olive configurations here.
Olive logs (llama2_Optimize) C:\Users\proxi\Olive\examples\directml\llama_v2>python llama_v2.py --model_type=7b-chat
Optimizing argmax_sampling [2024-01-28 19:05:45,570] [INFO] [accelerator.py:205:create_accelerators] Running workflow on accelerator specs: gpu-dml [2024-01-28 19:05:45,578] [INFO] [engine.py:851:_run_pass] Running pass convert:OnnxConversion [2024-01-28 19:05:45,590] [INFO] [footprint.py:101:create_pareto_frontier] Output all 2 models [2024-01-28 19:05:45,590] [INFO] [footprint.py:120:_create_pareto_frontier_from_nodes] pareto frontier points: 0_OnnxConversion-14dc7b7c3125d3ad1222f0b9e2e5b807-dc5fbbbe422d406cc8fcef71d99251a4 { "latency-avg": 0.23591 } [2024-01-28 19:05:45,591] [INFO] [engine.py:282:run] Run history for gpu-dml: [2024-01-28 19:05:45,595] [INFO] [engine.py:557:dump_run_history] run history: +------------------------------------------------------------------------------------+----------------------------------+----------------+----------------+--------------------------+ | model_id | parent_model_id | from_pass | duration_sec | metrics | +====================================================================================+==================================+================+================+==========================+ | 14dc7b7c3125d3ad1222f0b9e2e5b807 | | | | | +------------------------------------------------------------------------------------+----------------------------------+----------------+----------------+--------------------------+ | 0_OnnxConversion-14dc7b7c3125d3ad1222f0b9e2e5b807-dc5fbbbe422d406cc8fcef71d99251a4 | 14dc7b7c3125d3ad1222f0b9e2e5b807 | OnnxConversion | 0.135661 | { | | | | | | "latency-avg": 0.23591 | | | | | | } | +------------------------------------------------------------------------------------+----------------------------------+----------------+----------------+--------------------------+ [2024-01-28 19:05:45,595] [INFO] [engine.py:296:run] No packaging config provided, skip packaging artifacts Optimized Model : C:\Users\proxi\Olive\examples\directml\llama_v2\cache\models\0_OnnxConversion-14dc7b7c3125d3ad1222f0b9e2e5b807-dc5fbbbe422d406cc8fcef71d99251a4\output_model\model.onnx
Optimizing llama_v2 [2024-01-28 19:05:45,607] [INFO] [accelerator.py:205:create_accelerators] Running workflow on accelerator specs: gpu-dml [2024-01-28 19:05:45,635] [INFO] [engine.py:851:_run_pass] Running pass convert:OnnxConversion C:\Users\proxi\miniconda3\envs\llama2_Optimize\lib\site-packages\torch\onnx_internal\jit_utils.py:307: UserWarning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\jit\passes\onnx\constant_fold.cpp:181.) _C._jit_pass_onnx_node_shape_type_inference(node, params_dict, opset_version) C:\Users\proxi\miniconda3\envs\llama2_Optimize\lib\site-packages\torch\onnx\utils.py:702: UserWarning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\jit\passes\onnx\constant_fold.cpp:181.) _C._jit_pass_onnx_graph_shape_type_inference( C:\Users\proxi\miniconda3\envs\llama2_Optimize\lib\site-packages\torch\onnx\utils.py:1209: UserWarning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\jit\passes\onnx\constant_fold.cpp:181.) _C._jit_pass_onnx_graph_shape_type_inference(
Other information
Additional context I have an AMD videocard. So I was looking for ways how to run llama-2 with AMD GPU. I found this guide and was following the steps. https://community.amd.com/t5/ai/how-to-running-optimized-llama2-with-microsoft-directml-on-amd/ba-p/645190