microsoft / Olive

Olive: Simplify ML Model Finetuning, Conversion, Quantization, and Optimization for CPUs, GPUs and NPUs.
https://microsoft.github.io/Olive/
MIT License
1.54k stars 163 forks source link

I don't have models/optimized/llama_v2 folder after I've run python llama_v2.py --optimize #905

Open KarpovVolodymyr opened 8 months ago

KarpovVolodymyr commented 8 months ago

Describe the bug Hello. I was following the steps from this guide https://community.amd.com/t5/ai/how-to-running-optimized-llama2-with-microsoft-directml-on-amd/ba-p/645190

at the end of step 2, when I run python llama_v2.py --optimize command, the running just stops each time if I try to re-run it again at the same place. It generates model.onnx file inOlive\examples\directml\llama_v2\cache\models\0_OnnxConversion-14dc7b7c3125d3ad1222f0b9e2e5b807-dc5fbbbe422d406cc8fcef71d99251a4\output_model\model.onnx and there is no models/optimized/llama_v2 folder.
As I understand there should be these files Знімок екрана 2024-01-28 193022

I don't have any errors so it is difficult to understand is something wrong or not

Could you please give me some advice what I do wrong?

To Reproduce Steps to reproduce the behavior.

Expected behavior From the description I expect this Once the script successfully completes, the optimized ONNX pipeline will be stored under models/optimized/llama_v2.

Olive config Add Olive configurations here.

Olive logs (llama2_Optimize) C:\Users\proxi\Olive\examples\directml\llama_v2>python llama_v2.py --model_type=7b-chat

Optimizing argmax_sampling [2024-01-28 19:05:45,570] [INFO] [accelerator.py:205:create_accelerators] Running workflow on accelerator specs: gpu-dml [2024-01-28 19:05:45,578] [INFO] [engine.py:851:_run_pass] Running pass convert:OnnxConversion [2024-01-28 19:05:45,590] [INFO] [footprint.py:101:create_pareto_frontier] Output all 2 models [2024-01-28 19:05:45,590] [INFO] [footprint.py:120:_create_pareto_frontier_from_nodes] pareto frontier points: 0_OnnxConversion-14dc7b7c3125d3ad1222f0b9e2e5b807-dc5fbbbe422d406cc8fcef71d99251a4 { "latency-avg": 0.23591 } [2024-01-28 19:05:45,591] [INFO] [engine.py:282:run] Run history for gpu-dml: [2024-01-28 19:05:45,595] [INFO] [engine.py:557:dump_run_history] run history: +------------------------------------------------------------------------------------+----------------------------------+----------------+----------------+--------------------------+ | model_id | parent_model_id | from_pass | duration_sec | metrics | +====================================================================================+==================================+================+================+==========================+ | 14dc7b7c3125d3ad1222f0b9e2e5b807 | | | | | +------------------------------------------------------------------------------------+----------------------------------+----------------+----------------+--------------------------+ | 0_OnnxConversion-14dc7b7c3125d3ad1222f0b9e2e5b807-dc5fbbbe422d406cc8fcef71d99251a4 | 14dc7b7c3125d3ad1222f0b9e2e5b807 | OnnxConversion | 0.135661 | { | | | | | | "latency-avg": 0.23591 | | | | | | } | +------------------------------------------------------------------------------------+----------------------------------+----------------+----------------+--------------------------+ [2024-01-28 19:05:45,595] [INFO] [engine.py:296:run] No packaging config provided, skip packaging artifacts Optimized Model : C:\Users\proxi\Olive\examples\directml\llama_v2\cache\models\0_OnnxConversion-14dc7b7c3125d3ad1222f0b9e2e5b807-dc5fbbbe422d406cc8fcef71d99251a4\output_model\model.onnx

Optimizing llama_v2 [2024-01-28 19:05:45,607] [INFO] [accelerator.py:205:create_accelerators] Running workflow on accelerator specs: gpu-dml [2024-01-28 19:05:45,635] [INFO] [engine.py:851:_run_pass] Running pass convert:OnnxConversion C:\Users\proxi\miniconda3\envs\llama2_Optimize\lib\site-packages\torch\onnx_internal\jit_utils.py:307: UserWarning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\jit\passes\onnx\constant_fold.cpp:181.) _C._jit_pass_onnx_node_shape_type_inference(node, params_dict, opset_version) C:\Users\proxi\miniconda3\envs\llama2_Optimize\lib\site-packages\torch\onnx\utils.py:702: UserWarning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\jit\passes\onnx\constant_fold.cpp:181.) _C._jit_pass_onnx_graph_shape_type_inference( C:\Users\proxi\miniconda3\envs\llama2_Optimize\lib\site-packages\torch\onnx\utils.py:1209: UserWarning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\jit\passes\onnx\constant_fold.cpp:181.) _C._jit_pass_onnx_graph_shape_type_inference(

Other information

Additional context I have an AMD videocard. So I was looking for ways how to run llama-2 with AMD GPU. I found this guide and was following the steps. https://community.amd.com/t5/ai/how-to-running-optimized-llama2-with-microsoft-directml-on-amd/ba-p/645190

trajepl commented 8 months ago

image Check this, seems you did not run optimization successfully. In your case, only argmax_sampling got optimized, the llama_v2 did not, right?

From the log, I cannot tell why the optimization failed, could you help please attach the completed log?

Karjhan commented 7 months ago

I alsp have an AMD GPU and I too have a similar issue regarding the same steps: https://community.amd.com/t5/ai/how-to-running-optimized-llama2-with-microsoft-directml-on-amd/ba-p/645190 Was this issue solved? If yes, how?

My Olive Logs: Optimizing llama_v2 [2024-03-07 15:56:48,646] [INFO] [accelerator.py:208:create_accelerators] Running workflow on accelerator specs: gpu-dml [2024-03-07 15:56:48,647] [INFO] [engine.py:116:initialize] Using cache directory: cache [2024-03-07 15:56:48,647] [INFO] [engine.py:272:run] Running Olive on accelerator: gpu-dml [2024-03-07 15:56:48,669] [INFO] [engine.py:862:_run_pass] Running pass convert:OnnxConversion C:\Users\anaconda3\envs\llama2_Optimize\lib\site-packages\torch\onnx_internal\jit_utils.py:307: UserWarning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\jit\passes\onnx\constant_fold.cpp:181.) _C._jit_pass_onnx_node_shape_type_inference(node, params_dict, opset_version) C:\Users\anaconda3\envs\llama2_Optimize\lib\site-packages\torch\onnx\utils.py:702: UserWarning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\jit\passes\onnx\constant_fold.cpp:181.) _C._jit_pass_onnx_graph_shape_type_inference( C:\Users\anaconda3\envs\llama2_Optimize\lib\site-packages\torch\onnx\utils.py:1209: UserWarning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\jit\passes\onnx\constant_fold.cpp:181.) _C._jit_pass_onnx_graph_shape_type_inference( [2024-03-07 16:02:23,506] [WARNING] [common.py:108:model_proto_to_file] Model is too large to save as a single file but 'save_as_external_data' is False. Saved tensors as external data regardless. [2024-03-07 16:06:51,058] [WARNING] [common.py:108:model_proto_to_file] Model is too large to save as a single file but 'save_as_external_data' is False. Saved tensors as external data regardless. [2024-03-07 16:06:51,127] [INFO] [engine.py:952:_run_pass] Pass convert:OnnxConversion finished in 602.456794 seconds [2024-03-07 16:06:51,139] [INFO] [engine.py:862:_run_pass] Running pass optimize:OrtTransformersOptimization

jamesalster commented 7 months ago

I have the same issue, following the same instructions as Karjhan.

Windows 11, AMD CPU, AMD 6700 XT. Using olive-ai-0.50, and

Olive logs: Optimizing llama_v2 [2024-03-07 19:56:03,993] [INFO] [accelerator.py:208:create_accelerators] Running workflow on accelerator specs: gpu-dml [2024-03-07 19:56:03,993] [INFO] [engine.py:116:initialize] Using cache directory: cache [2024-03-07 19:56:03,993] [INFO] [engine.py:272:run] Running Olive on accelerator: gpu-dml [2024-03-07 19:56:04,026] [INFO] [engine.py:862:_run_pass] Running pass convert:OnnxConversion [2024-03-07 19:56:04,034] [INFO] [engine.py:896:_run_pass] Loaded model from cache: 1_OnnxConversion-9c3612d31e59051b1903b377f456134d-dc5fbbbe422d406cc8fcef71d99251a4 from cache\runs [2024-03-07 19:56:04,034] [INFO] [engine.py:862:_run_pass] Running pass optimize:OrtTransformersOptimization

Sometimes, I get the following message, as well:

[2024-03-07 19:36:13,073] [WARNING] [common.py:108:model_proto_to_file] Model is too large to save as a single file but 'save_as_external_data' is False. Saved tensors as external data regardless.

trajepl commented 7 months ago

@PatriceVignola Could you help take a look?

PatriceVignola commented 7 months ago

These are classic OOM symptoms when running the script. Unfortunately the ORT optimizer that Olive uses needs way more memory than in should, which results in those OOM crashes without error messages.

Usually, I would recommend between 200gb and 300gb of RAM (which can include your pagefile). On my machine, I have 128gb of RAM and a pagefile of about 150gb and it takes around 30 minutes to go through the conversion and optimization process. It can also be done with less physical memory (and bigger pagefile size), but it might take longer.

jamesalster commented 7 months ago

That makes sense, thank you, I don't have that much memory + pagefile available. I'll keep an eye on the repository for any future updates that reduce the memory required.

andresmejia10-int commented 5 months ago

Tested on two different state of the art systems and i am trap in the error:

onnxruntime.capi.onnxruntime_pybind11_state.InvalidProtobuf: [ONNXRuntimeError] : 7 : INVALID_PROTOBUF : Load model from C:\Olive\examples\directml\llama_v2\cache\models\3_OptimumMerging-2-29928aa56000b48c6135423fa1102e45-gpu-dml\output_model\decoder_model_merged.onnx failed:Protobuf parsing failed.

i wonder if thi sis the cause ?

[2024-04-11 10:11:58,163] [INFO] [engine.py:873:_run_pass] Running pass merge:OptimumMerging Merged ONNX model exceeds 2GB, the model will not be checked without save_path given.

LOG:

C:\Olive\examples\directml\llama_v2>python llama_v2.py

Optimizing argmax_sampling [2024-04-11 09:06:46,186] [INFO] [run.py:246:run] Loading Olive module configuration from: C:\Olive\olive\olive_config.json [2024-04-11 09:06:46,186] [INFO] [accelerator.py:324:create_accelerators] Running workflow on accelerator specs: gpu-dml [2024-04-11 09:06:46,186] [INFO] [run.py:199:run_engine] Importing pass module OnnxConversion [2024-04-11 09:06:46,197] [INFO] [engine.py:115:initialize] Using cache directory: cache [2024-04-11 09:06:46,197] [INFO] [engine.py:271:run] Running Olive on accelerator: gpu-dml [2024-04-11 09:06:46,202] [INFO] [engine.py:873:_run_pass] Running pass convert:OnnxConversion [2024-04-11 09:06:47,720] [INFO] [engine.py:960:_run_pass] Pass convert:OnnxConversion finished in 1.518627 seconds [2024-04-11 09:06:47,727] [INFO] [engine.py:851:_run_passes] Run model evaluation for the final model... [2024-04-11 09:06:48,244] [INFO] [engine.py:370:run_accelerator] Save footprint to footprints\argmax_sampling_gpu-dml_footprints.json. [2024-04-11 09:06:48,244] [INFO] [engine.py:288:run] Run history for gpu-dml: [2024-04-11 09:06:48,244] [INFO] [engine.py:576:dump_run_history] run history: +------------------------------------------------------------------------------------+----------------------------------+------+ | model_id | parent_model_id | from_pass | duration_sec | metrics | +====================================================================================+==================================+======+ | 14dc | | | | | +------------------------------------------------------------------------------------+----------------------------------+------+ | 0_OnnxConversion-14dc-818 | 14dc | OnnxConversion | 1.51863 | "latency-avg": 0.4114| +------------------------------------------------------------------------------------+----------------------------------+------+ [2024-04-11 09:06:48,244] [INFO] [engine.py:303:run] No packaging config provided, skip packaging artifacts Optimized Model : C:\Olive\examples\directml\llama_v2\cache\models\0_OnnxConversion-14dc7b7c3125d3ad1222f0b9e2e5b807-8183e3a10c90bb4a9507d579143be30e\output_model\model.onnx

Optimizing llama_v2 [2024-04-11 09:06:48,260] [INFO] [run.py:246:run] Loading Olive module configuration from: C:\Olive\olive\olive_config.json [2024-04-11 09:06:48,260] [INFO] [accelerator.py:324:create_accelerators] Running workflow on accelerator specs: gpu-dml [2024-04-11 09:06:48,260] [INFO] [run.py:199:run_engine] Importing pass module OnnxConversion [2024-04-11 09:06:48,260] [INFO] [run.py:199:run_engine] Importing pass module OrtTransformersOptimization [2024-04-11 09:06:48,260] [INFO] [run.py:199:run_engine] Importing pass module OptimumMerging [2024-04-11 09:06:48,260] [INFO] [engine.py:115:initialize] Using cache directory: cache [2024-04-11 09:06:48,260] [INFO] [engine.py:271:run] Running Olive on accelerator: gpu-dml [2024-04-11 09:06:48,292] [INFO] [engine.py:873:_run_pass] Running pass convert:OnnxConversion C:\Users\t\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\onnx_internal\jit_utils.py:307: UserWarning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied. (Triggered internally at ..\torch\csrc\jit\passes\onnx\constant_fold.cpp:181.) _C._jit_pass_onnx_node_shape_type_inference(node, params_dict, opset_version) C:\Users\t\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\onnx\utils.py:702: UserWarning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied. (Triggered internally at ..\torch\csrc\jit\passes\onnx\constant_fold.cpp:181.) _C._jit_pass_onnx_graph_shape_type_inference( C:\Users\t\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\onnx\utils.py:1209: UserWarning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied. (Triggered internally at ..\torch\csrc\jit\passes\onnx\constant_fold.cpp:181.) _C._jit_pass_onnx_graph_shape_type_inference( [2024-04-11 09:39:36,710] [INFO] [engine.py:960:_run_pass] Pass convert:OnnxConversion finished in 1968.418042 seconds [2024-04-11 09:39:36,757] [INFO] [engine.py:873:_run_pass] Running pass optimize:OrtTransformersOptimization [2024-04-11 10:11:58,085] [INFO] [engine.py:960:_run_pass] Pass optimize:OrtTransformersOptimization finished in 1941.281035 seconds

[2024-04-11 10:11:58,163] [INFO] [engine.py:873:_run_pass] Running pass merge:OptimumMerging Merged ONNX model exceeds 2GB, the model will not be checked without save_path given. [2024-04-11 10:17:38,660] [ERROR] [engine.py:955:_run_pass] Pass run failed. Traceback (most recent call last): File "C:\Olive\olive\engine\engine.py", line 943, in _run_pass output_model_config = host.run_pass(p, input_model_config, data_root, output_model_path, pass_search_point) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Olive\olive\systems\local.py", line 31, in run_pass output_model = the_pass.run(model, data_root, output_model_path, point) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Olive\olive\passes\olive_pass.py", line 216, in run output_model = self._run_for_config(model, data_root, config, output_model_path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Olive\olive\passes\onnx\optimum_merging.py", line 85, in _run_for_config onnxruntime.InferenceSession(output_model_path, sess_options, providers=[execution_provider]) File "C:\Users\t\AppData\Local\Programs\Python\Python311\Lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in init self._create_inference_session(providers, provider_options, disabled_optimizers) File "C:\Users\t\AppData\Local\Programs\Python\Python311\Lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 472, in _create_inference_session sess = C.InferenceSession(session_options, self._model_path, True, self._read_config_from_model) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ onnxruntime.capi.onnxruntime_pybind11_state.InvalidProtobuf: [ONNXRuntimeError] : 7 : INVALID_PROTOBUF : Load model from C:\Olive\examples\directml\llama_v2\cache\models\3_OptimumMerging-2-29928aa56000b48c6135423fa1102e45-gpu-dml\output_model\decoder_model_merged.onnx failed:Protobuf parsing failed.

josephyuzb commented 5 months ago

I ran :python llama_v2.py error message is here:

RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory [2024-04-25 19:58:01,158] [INFO] [engine.py:279:run] Run history for gpu-dml: [2024-04-25 19:58:01,159] [INFO] [engine.py:567:dump_run_history] run history: +----------------------------------+-------------------+-------------+----------------+-----------+ | model_id | parent_model_id | from_pass | duration_sec | metrics | +==================================+===================+=============+================+===========+ | 9c3612d31e59051b1903b377f456134d | | | | | +----------------------------------+-------------------+-------------+----------------+-----------+ [2024-04-25 19:58:01,159] [INFO] [engine.py:294:run] No packaging config provided, skip packaging artifacts Traceback (most recent call last): File "C:\Users\Administrator\olive\examples\directml\llama_v2\llama_v2.py", line 217, in optimize(optimized_model_dir, args.model_type) File "C:\Users\Administrator\olive\examples\directml\llama_v2\llama_v2.py", line 76, in optimize with footprints_file_path.open("r") as footprint_file: File "C:\ProgramData\Anaconda3\envs\llama2_Optimize\lib\pathlib.py", line 1252, in open return io.open(self, mode, buffering, encoding, errors, newline, File "C:\ProgramData\Anaconda3\envs\llama2_Optimize\lib\pathlib.py", line 1120, in _opener return self._accessor.open(self, flags, mode) FileNotFoundError: [Errno 2] No such file or directory: 'C:\Users\Administrator\olive\examples\directml\llama_v2\footprints\llama_v2_gpu-dml_footprints.json'

(C:\ProgramData\Anaconda3\envs\llama2_Optimize) C:\Users\Administrator\olive\examples\directml\llama_v2>