microsoft / Olive

Olive: Simplify ML Model Finetuning, Conversion, Quantization, and Optimization for CPUs, GPUs and NPUs.
https://microsoft.github.io/Olive/
MIT License
1.49k stars 154 forks source link

Olive workflow for mistral model optimization does not work #1075

Open jojo1899 opened 4 months ago

jojo1899 commented 4 months ago

Describe the bug Following the instructions in examples/mistral does not result in a quantized onnx model. After running the workflow, my output_model folder within the cache directory contains an onnx model that is 27 GB on disk and the models folder does not contain a quantized model.

To Reproduce Follow the instructions in examples/mistral to run the optimization on CPU using: python mistral.py --optimize --config mistral_int4_optimize.json

Expected behavior Expected to obtain an output model that is around 3.5 GB in the models directory.

Olive config Available here

{
    "input_model": {
        "type": "PyTorchModel",
        "config": {
            "hf_config": {
                "model_name": "mistralai/Mistral-7B-v0.1",
                "model_class": "MistralForCausalLM"
            }
        }
    },
    "systems": {
        "local_system": {
            "type": "LocalSystem",
            "config": {
                "accelerators": [
                    {
                        "device": "cpu",
                        "execution_providers": [
                            "CPUExecutionProvider"
                        ]
                    }
                ]
            }
        }
    },
    "evaluators": {
        "common_evaluator": {
            "metrics": [
                {
                    "name": "latency",
                    "type": "latency",
                    "sub_types": [
                        {
                            "name": "avg",
                            "priority": 1
                        }
                    ],
                    "user_config": {
                        "user_script": "user_script.py",
                        "dataloader_func": "create_dataloader",
                        "batch_size": 1,
                        "inference_settings" : {
                            "onnx": {
                                "session_options": {
                                    "enable_profiling": false
                                }
                            }
                        }
                    }
                }
            ]
        }
    },
    "passes": {
        "convert": {
            "type": "OptimumConversion",
            "config": {
                "target_opset": 14,
                "extra_args": {
                    "legacy": false,
                    "no_post_process": false
                }
            }
        },
        "optimize": {
            "type": "OrtTransformersOptimization",
            "config": {
                "model_type": "gpt2",
                "use_gpu": false,
                "keep_io_types": true,
                "optimization_options": {
                    "use_multi_head_attention": false
                },
                "save_as_external_data": true,
                "all_tensors_to_one_file": true
            }
        },
        "quantization": {
            "type": "IncStaticQuantization",
            "config": {
                "user_script": "user_script.py",
                "approach": "weight_only",
                "weight_only_config": {
                    "bits": 4,
                    "algorithm": "GPTQ"
                },
                "recipes":{
                    "gptq_args": {
                        "accuracy_level": 0
                    }
                },
                "dataloader_func": "calib_dataloader",
                "calibration_sampling_size": [
                    8
                ],
                "save_as_external_data": true,
                "all_tensors_to_one_file": true,
                "diagnosis": false
            }
        }
    },
    "pass_flows": [
        [
            "convert",
            "optimize",
            "quantization"
        ]
    ],
    "engine": {
        "evaluate_input_model": false,
        "evaluator": "common_evaluator",
        "host": "local_system",
        "target": "local_system",
        "cache_dir": "cache",
        "output_dir": "models",
        "output_name": "mistral_int4"
    }
}

Olive logs C:\Olive\examples\mistral>python mistral.py --optimize --config mistral_int4_optimize.json Optimizing mistralai/Mistral-7B-v0.1 [2024-04-11 15:14:42,927] [INFO] [run.py:243:run] Loading Olive module configuration from: C:\Olive\olive\olive_config.json [2024-04-11 15:14:42,933] [INFO] [accelerator.py:324:create_accelerators] Running workflow on accelerator specs: cpu-cpu [2024-04-11 15:14:42,934] [INFO] [run.py:196:run_engine] Importing pass module OptimumConversion [2024-04-11 15:14:42,934] [INFO] [run.py:196:run_engine] Importing pass module OrtTransformersOptimization [2024-04-11 15:14:42,935] [INFO] [run.py:196:run_engine] Importing pass module IncStaticQuantization [2024-04-11 15:14:42,936] [INFO] [engine.py:106:initialize] Using cache directory: cache [2024-04-11 15:14:42,937] [INFO] [engine.py:262:run] Running Olive on accelerator: cpu-cpu [2024-04-11 15:14:43,817] [INFO] [engine.py:864:_run_pass] Running pass convert:OptimumConversion Framework not specified. Using pt to export the model. Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00, 3.22s/it] Automatic task detection to text-generation-with-past (possible synonyms are: causal-lm-with-past). Using the export variant default. Available variants are:

Other information

Additional context It appears that the quantization is not being performed at all. So checking out what the issue is.

guotuofeng commented 4 months ago

what's the full log? it seems the cache folder contains the converted model.

jojo1899 commented 4 months ago

That is the full log. I figured out there was some issue in optimizing the converted model. So I made the following changes to the mistral_int4_optimize.json config file by removing the "optimize" from the pass_flows and updating as follows:

"pass_flows": [
        [
            "convert",
            "quantization"
        ]

When I ran the script again, it seems to produce a quantized model, but that is only 1.45 GB on disk. I tried running the model using the CPUExecutionProvider and then the CUDAExecutionProvider, but it gives a runtime error:

Traceback (most recent call last):
  File "c:\onnxrt\mainonnx.py", line 42, in <module>
    sess = InferenceSession(hf_model_path + "/model.onnx",
  File "C:\MiniConda3\envs\myonnxrt\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "C:\MiniConda3\envs\myonnxrt\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 483, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Deserialize tensor onnx::MatMul_10759_Q4G32 failed.tensorprotoutils.cc:904 onnxruntime::utils::GetExtDataFromTensorProto External initializer: onnx::MatMul_10759_Q4G32 offset: 3208609792 size to read: 8388608 given file_length: 1559232512 are out of bounds or can not be read in full.
guotuofeng commented 4 months ago

If above is full log, I am guessing your hit out of memory when optimize the converted onnx model. the oom will kill the python process by OS.

Could you try find bigger memory and retry?

jojo1899 commented 4 months ago

Yes, I am on it. The quantization took around 3.5 hours on my Intel i9-13980HX CPU. So it is time consuming to test mistral_int4_optimize.json on different systems. How can mistral_fp16_optimize.json be modified so that I can try INT4 GPTQ quantization on my GPU with CUDAExecutionProvider?

guotuofeng commented 4 months ago

would you try by changing the accelerators like https://github.com/microsoft/Olive/blob/main/examples/mistral/mistral_fp16_optimize.json#L15-L21

guotuofeng commented 4 months ago

for more info, please refer to https://microsoft.github.io/Olive/tutorials/configure_systems.html

jojo1899 commented 4 months ago

Thank you @guotuofeng. I can confirm that the examples mistral_int4_optimize.json and mistral_fp16_optimize.json work.

If anyone faces similar issues, make sure that you have sufficient disk space (around 100 GB or more). The disk space seemed to be a bottleneck for me and not the RAM. I tested it on two computers with 64 GB RAM and it worked well. Here are some details for the mistral_int4_optimize.json workflow:

  1. The mistral_int4_optimize.json workflow took me around 3.5 hr to run on high-end CPUs.
  2. The quantized model is 4.76 GB on disk.

I faced some other issues such as the resulting quantized model's responses being very poor and the CUDAExecutionProvider not working with a recent Nvidia SUPER graphics card that I am using. I will try to fix them and get back if needed.

jojo1899 commented 4 months ago

Using mistral.py, we can carry out inference using the CUDAExecutionProvider or on the CPU. How can we perform inference on the GPU using DmlExecutionProvider?

onnxruntime-genai seemed to be an option, but it does not yet have support for DmlExecutionProvider.

guotuofeng commented 4 months ago

could you try https://microsoft.github.io/Olive/api/passes.html#cmdoption-arg-115 with backend onnxrt_dml_ep? I am not sure whether the int4 quantization works or not against dml.

jojo1899 commented 4 months ago

I suppose you meant onnxrt_dml_ep and not onnxrt_dnnl_ep. Anyway, I tried both.

TRIAL 1: I updated the mistral_int4_optimize.json as follows. I added "backend": "onnxrt_dnnl_ep" for IncStaticQuantization. While running the workflow, it warns as follows: Specified provider 'DnnlExecutionProvider' is not in available provider names. Fallback to available providers: 'DmlExecutionProvider, CPUExecutionProvider'. The workflow finished relatively quickly and the resulting 'quantized' model is 27 GB on disk. Olive configuration:

{
    "input_model": {
        "type": "PyTorchModel",
        "config": {
            "hf_config": {
                "model_name": "mistralai/Mistral-7B-Instruct-v0.1",
                "model_class": "MistralForCausalLM"
            }
        }
    },
    "systems": {
        "local_system": {
            "type": "LocalSystem",
            "config": {
                "accelerators": [
                    {
                        "device": "gpu",
                        "execution_providers": [
                            "DmlExecutionProvider"
                        ]
                    }
                ]
            }
        }
    },
    "evaluators": {
        "common_evaluator": {
            "metrics": [
                {
                    "name": "latency",
                    "type": "latency",
                    "sub_types": [
                        {
                            "name": "avg",
                            "priority": 1
                        }
                    ],
                    "user_config": {
                        "user_script": "user_script.py",
                        "dataloader_func": "create_dataloader",
                        "batch_size": 1,
                        "inference_settings" : {
                            "onnx": {
                                "session_options": {
                                    "enable_profiling": false
                                }
                            }
                        }
                    }
                }
            ]
        }
    },
    "passes": {
        "convert": {
            "type": "OptimumConversion",
            "config": {
                "target_opset": 14,
                "extra_args": {
                    "legacy": false,
                    "no_post_process": false
                }
            }
        },
        "optimize": {
            "type": "OrtTransformersOptimization",
            "config": {
                "model_type": "gpt2",
                "use_gpu": false,
                "keep_io_types": true,
                "optimization_options": {
                    "use_multi_head_attention": false
                },
                "save_as_external_data": true,
                "all_tensors_to_one_file": true
            }
        },
        "quantization": {
            "type": "IncStaticQuantization",
            "config": {
                "backend": "onnxrt_dnnl_ep",
                "user_script": "user_script.py",
                "approach": "weight_only",
                "weight_only_config": {
                    "bits": 4,
                    "algorithm": "GPTQ"
                },
                "recipes":{
                    "gptq_args": {
                        "accuracy_level": 0
                    }
                },
                "dataloader_func": "calib_dataloader",
                "calibration_sampling_size": [
                    8
                ],
                "save_as_external_data": true,
                "all_tensors_to_one_file": true,
                "diagnosis": false
            }
        }
    },
    "pass_flows": [
        [
            "convert",
            "optimize",
            "quantization"
        ]
    ],
    "engine": {
        "evaluate_input_model": false,
        "evaluator": "common_evaluator",
        "host": "local_system",
        "target": "local_system",
        "cache_dir": "cache",
        "output_dir": "models",
        "output_name": "mistral_int4_dml"
    }
}

Here is the full log:

C:\Olive\examples\mistral>python mistral.py --optimize --config mistral_int4_optimize.json
Optimizing mistralai/Mistral-7B-v0.1
[2024-04-18 11:07:03,572] [INFO] [run.py:261:run] Loading Olive module configuration from: C:\Olive\olive\olive_config.json
[2024-04-18 11:07:03,588] [INFO] [accelerator.py:336:create_accelerators] Running workflow on accelerator specs: gpu-dml
[2024-04-18 11:07:03,588] [INFO] [engine.py:106:initialize] Using cache directory: cache
[2024-04-18 11:07:03,588] [INFO] [engine.py:262:run] Running Olive on accelerator: gpu-dml
[2024-04-18 11:07:04,825] [INFO] [engine.py:864:_run_pass] Running pass convert:OptimumConversion
[2024-04-18 11:07:04,825] [INFO] [engine.py:898:_run_pass] Loaded model from cache: 17_OptimumConversion-5af0f1a930787dedd19fa4814997b8a4-ad904e90276e2793a36f3373323e91e1 from cache\runs
[2024-04-18 11:07:04,825] [INFO] [engine.py:864:_run_pass] Running pass optimize:OrtTransformersOptimization
[2024-04-18 11:35:59,179] [INFO] [engine.py:951:_run_pass] Pass optimize:OrtTransformersOptimization finished in 1734.338259 seconds
[2024-04-18 11:35:59,187] [INFO] [engine.py:864:_run_pass] Running pass quantization:IncStaticQuantization
[2024-04-18 11:36:05,418] [WARNING] [inc_quantization.py:440:_set_tuning_config] 'metric' is not set for INC Quantization Pass. Intel® Neural Compressor will quantize model without accuracy aware tuning. Please set 'metric' if you want to use Intel® Neural Compressorquantization with accuracy aware tuning.
2024-04-18 11:37:22 [INFO] Start auto tuning.
2024-04-18 11:37:22 [INFO] Quantize model without tuning!
2024-04-18 11:37:22 [INFO] Quantize the model with default configuration without evaluating the model.                To perform the tuning process, please either provide an eval_func or provide an                    eval_dataloader an eval_metric.
2024-04-18 11:37:22 [INFO] Adaptor has 5 recipes.
2024-04-18 11:37:22 [INFO] 0 recipes specified by user.
2024-04-18 11:37:22 [INFO] 3 recipes require future tuning.
2024-04-18 11:37:22 [WARNING] Specified provider 'DnnlExecutionProvider' is not in available provider names. Fallback to available providers: 'DmlExecutionProvider, CPUExecutionProvider'
2024-04-18 11:37:22 [INFO] *** Initialize auto tuning
Exception in thread Thread-4:
2024-04-18 11:37:22 [INFO] {
Traceback (most recent call last):
2024-04-18 11:37:22 [INFO]     'PostTrainingQuantConfig': {
  File "C:\Anaconda\envs\myolive\lib\threading.py", line 980, in _bootstrap_inner
2024-04-18 11:37:22 [INFO]         'AccuracyCriterion': {
2024-04-18 11:37:22 [INFO]             'criterion': 'relative',
2024-04-18 11:37:22 [INFO]             'higher_is_better': True,
2024-04-18 11:37:22 [INFO]             'tolerable_loss': 0.01,
2024-04-18 11:37:22 [INFO]             'absolute': None,
2024-04-18 11:37:22 [INFO]             'keys': <bound method AccuracyCriterion.keys of <neural_compressor.config.AccuracyCriterion object at 0x000001A64EFA7550>>,
2024-04-18 11:37:22 [INFO]             'relative': 0.01
2024-04-18 11:37:22 [INFO]         },
2024-04-18 11:37:22 [INFO]         'approach': 'post_training_weight_only',
2024-04-18 11:37:22 [INFO]         'backend': 'onnxrt_dnnl_ep',
2024-04-18 11:37:22 [INFO]         'calibration_sampling_size': [
2024-04-18 11:37:22 [INFO]             8
2024-04-18 11:37:22 [INFO]         ],
2024-04-18 11:37:22 [INFO]         'device': 'cpu',
2024-04-18 11:37:22 [INFO]         'diagnosis': False,
2024-04-18 11:37:22 [INFO]         'domain': 'auto',
2024-04-18 11:37:22 [INFO]         'example_inputs': 'Not printed here due to large size tensors...',
2024-04-18 11:37:22 [INFO]         'excluded_precisions': [
2024-04-18 11:37:22 [INFO]         ],
2024-04-18 11:37:22 [INFO]         'framework': 'onnxruntime',
2024-04-18 11:37:22 [INFO]         'inputs': [
2024-04-18 11:37:22 [INFO]         ],
2024-04-18 11:37:22 [INFO]         'model_name': '',
2024-04-18 11:37:22 [INFO]         'ni_workload_name': 'quantization',
2024-04-18 11:37:22 [INFO]         'op_name_dict': None,
2024-04-18 11:37:22 [INFO]         'op_type_dict': {
2024-04-18 11:37:22 [INFO]             '.*': {
2024-04-18 11:37:22 [INFO]                 'weight': {
2024-04-18 11:37:22 [INFO]                     'bits': [
2024-04-18 11:37:22 [INFO]                         4
2024-04-18 11:37:22 [INFO]                     ],
2024-04-18 11:37:22 [INFO]                     'group_size': [
2024-04-18 11:37:22 [INFO]                         32
2024-04-18 11:37:22 [INFO]                     ],
2024-04-18 11:37:22 [INFO]                     'scheme': [
2024-04-18 11:37:22 [INFO]                         'asym'
2024-04-18 11:37:22 [INFO]                     ],
2024-04-18 11:37:22 [INFO]                     'algorithm': [
2024-04-18 11:37:22 [INFO]                         'GPTQ'
2024-04-18 11:37:22 [INFO]                     ]
2024-04-18 11:37:22 [INFO]                 }
2024-04-18 11:37:22 [INFO]             }
2024-04-18 11:37:22 [INFO]         },
2024-04-18 11:37:22 [INFO]         'outputs': [
2024-04-18 11:37:22 [INFO]         ],
2024-04-18 11:37:22 [INFO]         'quant_format': 'QOperator',
    self.run()
  File "C:\Anaconda\envs\myolive\lib\threading.py", line 1304, in run
2024-04-18 11:37:22 [INFO]         'quant_level': 'auto',
2024-04-18 11:37:22 [INFO]         'recipes': {
2024-04-18 11:37:22 [INFO]             'smooth_quant': False,
2024-04-18 11:37:22 [INFO]             'smooth_quant_args': {
2024-04-18 11:37:22 [INFO]             },
2024-04-18 11:37:22 [INFO]             'layer_wise_quant': False,
2024-04-18 11:37:22 [INFO]             'layer_wise_quant_args': {
2024-04-18 11:37:22 [INFO]             },
2024-04-18 11:37:22 [INFO]             'fast_bias_correction': False,
    self.finished.wait(self.interval)
  File "C:\Anaconda\envs\myolive\lib\threading.py", line 581, in wait
2024-04-18 11:37:22 [INFO]             'weight_correction': False,
2024-04-18 11:37:22 [INFO]             'gemm_to_matmul': True,
2024-04-18 11:37:22 [INFO]             'graph_optimization_level': None,
2024-04-18 11:37:22 [INFO]             'first_conv_or_matmul_quantization': True,
2024-04-18 11:37:22 [INFO]             'last_conv_or_matmul_quantization': True,
2024-04-18 11:37:22 [INFO]             'pre_post_process_quantization': True,
2024-04-18 11:37:22 [INFO]             'add_qdq_pair_to_weight': False,
    signaled = self._cond.wait(timeout)
2024-04-18 11:37:22 [INFO]             'optypes_to_exclude_output_quant': [
  File "C:\Anaconda\envs\myolive\lib\threading.py", line 316, in wait
2024-04-18 11:37:22 [INFO]             ],
2024-04-18 11:37:22 [INFO]             'dedicated_qdq_pair': False,
2024-04-18 11:37:22 [INFO]             'rtn_args': {
2024-04-18 11:37:22 [INFO]             },
2024-04-18 11:37:22 [INFO]             'awq_args': {
2024-04-18 11:37:22 [INFO]             },
2024-04-18 11:37:22 [INFO]             'gptq_args': {
2024-04-18 11:37:22 [INFO]                 'accuracy_level': 0
    gotit = waiter.acquire(True, timeout)
2024-04-18 11:37:22 [INFO]             },
OverflowError: timeout value is too large
2024-04-18 11:37:22 [INFO]             'teq_args': {
2024-04-18 11:37:22 [INFO]             },
2024-04-18 11:37:22 [INFO]             'autoround_args': {
2024-04-18 11:37:22 [INFO]             }
2024-04-18 11:37:22 [INFO]         },
2024-04-18 11:37:22 [INFO]         'reduce_range': False,
2024-04-18 11:37:22 [INFO]         'TuningCriterion': {
2024-04-18 11:37:22 [INFO]             'max_trials': 100,
2024-04-18 11:37:22 [INFO]             'objective': [
2024-04-18 11:37:22 [INFO]                 'performance'
2024-04-18 11:37:22 [INFO]             ],
2024-04-18 11:37:22 [INFO]             'strategy': 'basic',
2024-04-18 11:37:22 [INFO]             'strategy_kwargs': None,
2024-04-18 11:37:22 [INFO]             'timeout': 0
2024-04-18 11:37:22 [INFO]         },
2024-04-18 11:37:22 [INFO]         'use_bf16': True
2024-04-18 11:37:22 [INFO]     }
2024-04-18 11:37:22 [INFO] }
2024-04-18 11:37:22 [WARNING] [Strategy] Please install `mpi4py` correctly if using distributed tuning; otherwise, ignore this warning.
2024-04-18 11:37:22 [WARNING] The model is automatically detected as a non-NLP model. You can use 'domain' argument in 'PostTrainingQuantConfig' to overwrite it
2024-04-18 11:37:22 [WARNING] Graph optimization level is automatically set to ENABLE_BASIC. You can use 'recipe' argument in 'PostTrainingQuantConfig'to overwrite it
C:\Anaconda\envs\myolive\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py:69: UserWarning: Specified provider 'DnnlExecutionProvider' is not in available provider names.Available providers: 'DmlExecutionProvider, CPUExecutionProvider'
  warnings.warn(
2024-04-18 11:38:05 [INFO] Do not evaluate the baseline and quantize the model with default configuration.
2024-04-18 11:38:05 [INFO] Quantize the model with default config.
2024-04-18 11:38:07 [INFO] |******Mixed Precision Statistics******|
2024-04-18 11:38:07 [INFO] +---------------------+----------------+
2024-04-18 11:38:07 [INFO] |       Op Type       |     Total      |
2024-04-18 11:38:07 [INFO] +---------------------+----------------+
2024-04-18 11:38:07 [INFO] +---------------------+----------------+
2024-04-18 11:38:07 [INFO] Pass quantize model elapsed time: 1917.88 ms
2024-04-18 11:38:07 [INFO] Save tuning history to C:\Olive\examples\mistral\nc_workspace\2024-04-18_11-35-59\./history.snapshot.
2024-04-18 11:38:07 [INFO] [Strategy] Found the model meets accuracy requirements, ending the tuning process.
2024-04-18 11:38:07 [INFO] Specified timeout or max trials is reached! Found a quantized model which meet accuracy goal. Exit.
2024-04-18 11:38:07 [INFO] Save deploy yaml to C:\Olive\examples\mistral\nc_workspace\2024-04-18_11-35-59\deploy.yaml
[2024-04-18 11:38:28,125] [INFO] [engine.py:951:_run_pass] Pass quantization:IncStaticQuantization finished in 148.929999 seconds
[2024-04-18 11:38:28,125] [INFO] [engine.py:842:_run_passes] Run model evaluation for the final model...
[2024-04-18 11:39:55,095] [INFO] [engine.py:361:run_accelerator] Save footprint to models\mistral_int4_dml_gpu-dml_footprints.json.
[2024-04-18 11:39:55,099] [INFO] [engine.py:279:run] Run history for gpu-dml:
[2024-04-18 11:39:55,117] [INFO] [engine.py:567:dump_run_history] run history:
+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-----------------------------+----------------+----------------------------+
| model_id                                                                               | parent_model_id                                                                        | from_pass                   |   duration_sec | metrics                    |
+========================================================================================+========================================================================================+=============================+================+============================+
| 5af0f1a930787dedd19fa4814997b8a4                                                       |                                                                                        |                             |                |                            |
+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-----------------------------+----------------+----------------------------+
| 17_OptimumConversion-5af0f1a930787dedd19fa4814997b8a4-ad904e90276e2793a36f3373323e91e1 | 5af0f1a930787dedd19fa4814997b8a4                                                       | OptimumConversion           |        378.937 |                            |
+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-----------------------------+----------------+----------------------------+
| 21_OrtTransformersOptimization-17-7ee24d7faf207d244aa16596fc4f536c-gpu-dml             | 17_OptimumConversion-5af0f1a930787dedd19fa4814997b8a4-ad904e90276e2793a36f3373323e91e1 | OrtTransformersOptimization |       1734.34  |                            |
+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-----------------------------+----------------+----------------------------+
| 22_IncStaticQuantization-21-d038bb1662e6b6fb8eec0b99098940cb-gpu-dml                   | 21_OrtTransformersOptimization-17-7ee24d7faf207d244aa16596fc4f536c-gpu-dml             | IncStaticQuantization       |        148.93  | {                          |
|                                                                                        |                                                                                        |                             |                |   "latency-avg": 520.68174 |
|                                                                                        |                                                                                        |                             |                | }                          |
+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-----------------------------+----------------+----------------------------+
[2024-04-18 11:39:55,119] [INFO] [engine.py:294:run] No packaging config provided, skip packaging artifacts

TRIAL 2: I changed "backend": "onnxrt_dnnl_ep" to "backend": "onnxrt_dml_ep" and ran the workflow again. It resulted in a few warnings and errors. A couple of noteworthy warnings from the log are as follows:

Here is the full log:

C:\Olive\examples\mistral>python mistral.py --optimize --config mistral_int4_optimize.json
Optimizing mistralai/Mistral-7B-v0.1
[2024-04-18 12:18:02,387] [INFO] [run.py:261:run] Loading Olive module configuration from: C:\Olive\olive\olive_config.json
[2024-04-18 12:18:02,406] [INFO] [accelerator.py:336:create_accelerators] Running workflow on accelerator specs: gpu-dml
[2024-04-18 12:18:02,406] [INFO] [engine.py:106:initialize] Using cache directory: cache
[2024-04-18 12:18:02,406] [INFO] [engine.py:262:run] Running Olive on accelerator: gpu-dml
[2024-04-18 12:18:04,096] [INFO] [engine.py:864:_run_pass] Running pass convert:OptimumConversion
[2024-04-18 12:18:04,096] [INFO] [engine.py:898:_run_pass] Loaded model from cache: 17_OptimumConversion-5af0f1a930787dedd19fa4814997b8a4-ad904e90276e2793a36f3373323e91e1 from cache\runs
[2024-04-18 12:18:04,096] [INFO] [engine.py:864:_run_pass] Running pass optimize:OrtTransformersOptimization
[2024-04-18 12:18:04,096] [INFO] [engine.py:898:_run_pass] Loaded model from cache: 21_OrtTransformersOptimization-17-7ee24d7faf207d244aa16596fc4f536c-gpu-dml from cache\runs
[2024-04-18 12:18:04,096] [INFO] [engine.py:864:_run_pass] Running pass quantization:IncStaticQuantization
[2024-04-18 12:18:07,192] [WARNING] [inc_quantization.py:440:_set_tuning_config] 'metric' is not set for INC Quantization Pass. Intel® Neural Compressor will quantize model without accuracy aware tuning. Please set 'metric' if you want to use Intel® Neural Compressorquantization with accuracy aware tuning.
2024-04-18 12:19:06 [INFO] Start auto tuning.
2024-04-18 12:19:06 [INFO] Quantize model without tuning!
2024-04-18 12:19:06 [INFO] Quantize the model with default configuration without evaluating the model.                To perform the tuning process, please either provide an eval_func or provide an                    eval_dataloader an eval_metric.
2024-04-18 12:19:06 [INFO] Adaptor has 5 recipes.
2024-04-18 12:19:06 [INFO] 0 recipes specified by user.
2024-04-18 12:19:06 [INFO] 3 recipes require future tuning.
2024-04-18 12:19:06 [WARNING] Backend `onnxrt_dml_ep` requires a NPU device. Reset device to 'npu'.
2024-04-18 12:19:06 [INFO] *** Initialize auto tuning
Exception in thread Thread-4:
Traceback (most recent call last):
  File "C:\Anaconda\envs\myolive\lib\threading.py", line 980, in _bootstrap_inner
2024-04-18 12:19:06 [INFO] {
2024-04-18 12:19:06 [INFO]     'PostTrainingQuantConfig': {
2024-04-18 12:19:06 [INFO]         'AccuracyCriterion': {
2024-04-18 12:19:06 [INFO]             'criterion': 'relative',
2024-04-18 12:19:06 [INFO]             'higher_is_better': True,
2024-04-18 12:19:06 [INFO]             'tolerable_loss': 0.01,
2024-04-18 12:19:06 [INFO]             'absolute': None,
2024-04-18 12:19:06 [INFO]             'keys': <bound method AccuracyCriterion.keys of <neural_compressor.config.AccuracyCriterion object at 0x000001E3B51BBF70>>,
2024-04-18 12:19:06 [INFO]             'relative': 0.01
2024-04-18 12:19:06 [INFO]         },
2024-04-18 12:19:06 [INFO]         'approach': 'post_training_weight_only',
2024-04-18 12:19:06 [INFO]         'backend': 'onnxrt_dml_ep',
2024-04-18 12:19:06 [INFO]         'calibration_sampling_size': [
2024-04-18 12:19:06 [INFO]             8
    self.run()
  File "C:\Anaconda\envs\myolive\lib\threading.py", line 1304, in run
2024-04-18 12:19:06 [INFO]         ],
2024-04-18 12:19:06 [INFO]         'device': 'cpu',
2024-04-18 12:19:06 [INFO]         'diagnosis': False,
2024-04-18 12:19:06 [INFO]         'domain': 'auto',
2024-04-18 12:19:06 [INFO]         'example_inputs': 'Not printed here due to large size tensors...',
2024-04-18 12:19:06 [INFO]         'excluded_precisions': [
2024-04-18 12:19:06 [INFO]         ],
    self.finished.wait(self.interval)
  File "C:\Anaconda\envs\myolive\lib\threading.py", line 581, in wait
2024-04-18 12:19:06 [INFO]         'framework': 'onnxruntime',
2024-04-18 12:19:06 [INFO]         'inputs': [
2024-04-18 12:19:06 [INFO]         ],
2024-04-18 12:19:06 [INFO]         'model_name': '',
2024-04-18 12:19:06 [INFO]         'ni_workload_name': 'quantization',
2024-04-18 12:19:06 [INFO]         'op_name_dict': None,
2024-04-18 12:19:06 [INFO]         'op_type_dict': {
2024-04-18 12:19:06 [INFO]             '.*': {
2024-04-18 12:19:06 [INFO]                 'weight': {
    signaled = self._cond.wait(timeout)
  File "C:\Anaconda\envs\myolive\lib\threading.py", line 316, in wait
2024-04-18 12:19:06 [INFO]                     'bits': [
2024-04-18 12:19:06 [INFO]                         4
2024-04-18 12:19:06 [INFO]                     ],
2024-04-18 12:19:06 [INFO]                     'group_size': [
2024-04-18 12:19:06 [INFO]                         32
2024-04-18 12:19:06 [INFO]                     ],
2024-04-18 12:19:06 [INFO]                     'scheme': [
2024-04-18 12:19:06 [INFO]                         'asym'
    gotit = waiter.acquire(True, timeout)
2024-04-18 12:19:06 [INFO]                     ],
OverflowError: timeout value is too large
2024-04-18 12:19:06 [INFO]                     'algorithm': [
2024-04-18 12:19:06 [INFO]                         'GPTQ'
2024-04-18 12:19:06 [INFO]                     ]
2024-04-18 12:19:06 [INFO]                 }
2024-04-18 12:19:06 [INFO]             }
2024-04-18 12:19:06 [INFO]         },
2024-04-18 12:19:06 [INFO]         'outputs': [
2024-04-18 12:19:06 [INFO]         ],
2024-04-18 12:19:06 [INFO]         'quant_format': 'QOperator',
2024-04-18 12:19:06 [INFO]         'quant_level': 'auto',
2024-04-18 12:19:06 [INFO]         'recipes': {
2024-04-18 12:19:06 [INFO]             'smooth_quant': False,
2024-04-18 12:19:06 [INFO]             'smooth_quant_args': {
2024-04-18 12:19:06 [INFO]             },
2024-04-18 12:19:06 [INFO]             'layer_wise_quant': False,
2024-04-18 12:19:06 [INFO]             'layer_wise_quant_args': {
2024-04-18 12:19:06 [INFO]             },
2024-04-18 12:19:06 [INFO]             'fast_bias_correction': False,
2024-04-18 12:19:06 [INFO]             'weight_correction': False,
2024-04-18 12:19:06 [INFO]             'gemm_to_matmul': True,
2024-04-18 12:19:06 [INFO]             'graph_optimization_level': None,
2024-04-18 12:19:06 [INFO]             'first_conv_or_matmul_quantization': True,
2024-04-18 12:19:06 [INFO]             'last_conv_or_matmul_quantization': True,
2024-04-18 12:19:06 [INFO]             'pre_post_process_quantization': True,
2024-04-18 12:19:06 [INFO]             'add_qdq_pair_to_weight': False,
2024-04-18 12:19:06 [INFO]             'optypes_to_exclude_output_quant': [
2024-04-18 12:19:06 [INFO]             ],
2024-04-18 12:19:06 [INFO]             'dedicated_qdq_pair': False,
2024-04-18 12:19:06 [INFO]             'rtn_args': {
2024-04-18 12:19:06 [INFO]             },
2024-04-18 12:19:06 [INFO]             'awq_args': {
2024-04-18 12:19:06 [INFO]             },
2024-04-18 12:19:06 [INFO]             'gptq_args': {
2024-04-18 12:19:06 [INFO]                 'accuracy_level': 0
2024-04-18 12:19:06 [INFO]             },
2024-04-18 12:19:06 [INFO]             'teq_args': {
2024-04-18 12:19:06 [INFO]             },
2024-04-18 12:19:06 [INFO]             'autoround_args': {
2024-04-18 12:19:06 [INFO]             }
2024-04-18 12:19:06 [INFO]         },
2024-04-18 12:19:06 [INFO]         'reduce_range': False,
2024-04-18 12:19:06 [INFO]         'TuningCriterion': {
2024-04-18 12:19:06 [INFO]             'max_trials': 100,
2024-04-18 12:19:06 [INFO]             'objective': [
2024-04-18 12:19:06 [INFO]                 'performance'
2024-04-18 12:19:06 [INFO]             ],
2024-04-18 12:19:06 [INFO]             'strategy': 'basic',
2024-04-18 12:19:06 [INFO]             'strategy_kwargs': None,
2024-04-18 12:19:06 [INFO]             'timeout': 0
2024-04-18 12:19:06 [INFO]         },
2024-04-18 12:19:06 [INFO]         'use_bf16': True
2024-04-18 12:19:06 [INFO]     }
2024-04-18 12:19:06 [INFO] }
2024-04-18 12:19:06 [WARNING] [Strategy] Please install `mpi4py` correctly if using distributed tuning; otherwise, ignore this warning.
2024-04-18 12:19:06 [WARNING] The model is automatically detected as a non-NLP model. You can use 'domain' argument in 'PostTrainingQuantConfig' to overwrite it
2024-04-18 12:19:06 [WARNING] Graph optimization level is automatically set to ENABLE_BASIC. You can use 'recipe' argument in 'PostTrainingQuantConfig'to overwrite it
2024-04-18 12:19:40 [INFO] Do not evaluate the baseline and quantize the model with default configuration.
2024-04-18 12:19:40 [INFO] Quantize the model with default config.
2024-04-18 12:19:41 [INFO] |******Mixed Precision Statistics******|
2024-04-18 12:19:41 [INFO] +---------------------+----------------+
2024-04-18 12:19:41 [INFO] |       Op Type       |     Total      |
2024-04-18 12:19:41 [INFO] +---------------------+----------------+
2024-04-18 12:19:41 [INFO] +---------------------+----------------+
2024-04-18 12:19:41 [INFO] Pass quantize model elapsed time: 843.77 ms
2024-04-18 12:19:41 [INFO] Save tuning history to C:\Olive\examples\mistral\nc_workspace\2024-04-18_12-18-04\./history.snapshot.
2024-04-18 12:19:41 [INFO] [Strategy] Found the model meets accuracy requirements, ending the tuning process.
2024-04-18 12:19:41 [INFO] Specified timeout or max trials is reached! Found a quantized model which meet accuracy goal. Exit.
2024-04-18 12:19:41 [INFO] Save deploy yaml to C:\Olive\examples\mistral\nc_workspace\2024-04-18_12-18-04\deploy.yaml
[2024-04-18 12:20:01,362] [INFO] [engine.py:951:_run_pass] Pass quantization:IncStaticQuantization finished in 117.265407 seconds
[2024-04-18 12:20:01,374] [INFO] [engine.py:842:_run_passes] Run model evaluation for the final model...
2024-04-18 12:20:02.4692837 [E:onnxruntime:, inference_session.cc:1997 onnxruntime::InferenceSession::Initialize::<lambda_80060d29f848598faaecbd5242ad430a>::operator ()] Exception during initialization: invalid unordered_map<K, T> key
[2024-04-18 12:20:02,468] [WARNING] [engine.py:357:run_accelerator] Failed to run Olive on gpu-dml.
Traceback (most recent call last):
  File "C:\Olive\olive\engine\engine.py", line 336, in run_accelerator
    output_footprint = self.run_no_search(
  File "C:\Olive\olive\engine\engine.py", line 428, in run_no_search
    should_prune, signal, model_ids = self._run_passes(
  File "C:\Olive\olive\engine\engine.py", line 843, in _run_passes
    signal = self._evaluate_model(model_config, model_id, data_root, evaluator_config, accelerator_spec)
  File "C:\Olive\olive\engine\engine.py", line 1041, in _evaluate_model
    signal = self.target.evaluate_model(model_config, data_root, metrics, accelerator_spec)
  File "C:\Olive\olive\systems\local.py", line 46, in evaluate_model
    return evaluator.evaluate(model, data_root, metrics, device=device, execution_providers=execution_providers)
  File "C:\Olive\olive\evaluator\olive_evaluator.py", line 214, in evaluate
    metrics_res[metric.name] = self._evaluate_latency(
  File "C:\Olive\olive\evaluator\olive_evaluator.py", line 132, in _evaluate_latency
    latencies = self._evaluate_raw_latency(
  File "C:\Olive\olive\evaluator\olive_evaluator.py", line 767, in _evaluate_raw_latency
    return self._evaluate_onnx_latency(model, metric, dataloader, post_func, device, execution_providers)
  File "C:\Olive\olive\evaluator\olive_evaluator.py", line 540, in _evaluate_onnx_latency
    session, inference_settings = OnnxEvaluator.get_session_wrapper(
  File "C:\Olive\olive\evaluator\olive_evaluator.py", line 435, in get_session_wrapper
    session = model.prepare_session(
  File "C:\Olive\olive\model\handler\onnx.py", line 114, in prepare_session
    return get_ort_inference_session(
  File "C:\Olive\olive\common\ort_inference.py", line 118, in get_ort_inference_session
    session = ort.InferenceSession(
  File "C:\Anaconda\envs\myolive\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "C:\Anaconda\envs\myolive\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 483, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: invalid unordered_map<K, T> key
[2024-04-18 12:20:02,515] [INFO] [engine.py:279:run] Run history for gpu-dml:
[2024-04-18 12:20:02,531] [INFO] [engine.py:567:dump_run_history] run history:
+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-----------------------------+----------------+-----------+
| model_id                                                                               | parent_model_id                                                                        | from_pass                   |   duration_sec | metrics   |
+========================================================================================+========================================================================================+=============================+================+===========+
| 5af0f1a930787dedd19fa4814997b8a4                                                       |                                                                                        |                             |                |           |
+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-----------------------------+----------------+-----------+
| 17_OptimumConversion-5af0f1a930787dedd19fa4814997b8a4-ad904e90276e2793a36f3373323e91e1 | 5af0f1a930787dedd19fa4814997b8a4                                                       | OptimumConversion           |        378.937 |           |
+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-----------------------------+----------------+-----------+
| 21_OrtTransformersOptimization-17-7ee24d7faf207d244aa16596fc4f536c-gpu-dml             | 17_OptimumConversion-5af0f1a930787dedd19fa4814997b8a4-ad904e90276e2793a36f3373323e91e1 | OrtTransformersOptimization |       1734.34  |           |
+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-----------------------------+----------------+-----------+
| 23_IncStaticQuantization-21-b76f1bb364ef9dc8aca22db9c5b3ee30-gpu-dml                   | 21_OrtTransformersOptimization-17-7ee24d7faf207d244aa16596fc4f536c-gpu-dml             | IncStaticQuantization       |        117.265 |           |
+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-----------------------------+----------------+-----------+
[2024-04-18 12:20:02,531] [INFO] [engine.py:294:run] No packaging config provided, skip packaging artifacts
guotuofeng commented 4 months ago

yes I mean dml ep. as for the error, we might need ask from dml ep team. @PatriceVignola, do you have any insight with this error?

guotuofeng commented 4 months ago

https://github.com/microsoft/Olive/issues/852#issuecomment-1876528882

jojo1899 commented 4 months ago

@guotuofeng The following code snippet works like a charm with the INT4 model created using the scripts in examples/mistral

config = AutoConfig.from_pretrained(hfmodelpath + "/config.json")
tokenizer = AutoTokenizer.from_pretrained(hfmodelpath)

options = ort.SessionOptions()

sess = InferenceSession(hfmodelpath + "/model.onnx",
                        load_external_data=True, 
                        sess_options=options,
                        provider = "CUDAExecutionProvider")

inputs = tokenizer("The lightest element is", return_tensors="pt")

model = ORTModelForCausalLM(sess, config, use_cache=True)    
model = model.to("cuda")
inputs = inputs.to('cuda')
starttime = time.time()
outputs = model.generate(**inputs, max_new_tokens=512)
endtime = time.time()
print(f"Latency = {endtime-starttime} seconds")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

I simply want to use DmlExecutionProvider instead of CUDAExecutionProvider. I tried the following, but it results in an error.

config = AutoConfig.from_pretrained(hfmodelpath + "/config.json")
tokenizer = AutoTokenizer.from_pretrained(hfmodelpath)

options = ort.SessionOptions()

sess = InferenceSession(hfmodelpath + "/model.onnx",
                        load_external_data=True, 
                        sess_options=options,
                        provider = "DmlExecutionProvider")

inputs = tokenizer("The lightest element is", return_tensors="pt")

model = ORTModelForCausalLM(sess, config, use_cache=True)    
device = torch_directml.device(0) 
model = model.to(device)
inputs = inputs.to(device)
starttime = time.time()
outputs = model.generate(**inputs, max_new_tokens=512)
endtime = time.time()
print(f"Latency = {endtime-starttime} seconds")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

RuntimeError: Cannot access data pointer of Tensor that doesn't have storage

Do you know if I can fix this error? or is it not possible to use DmlExecutionProvider in this case?

guotuofeng commented 4 months ago

I am not sure, I don't try DML before since we doesn't have dml GPU.

jojo1899 commented 4 months ago

@guotuofeng Thank you for the responses.

I am now trying out LLM Optimization with DirectML, which has been updated yesterday.

guotuofeng commented 4 months ago

Actually, some OPs is still pending to merge in that example.

ShivamGoyal03 commented 2 months ago

Picture1

  1. Configuration Load Error OSError: Can't load the configuration of 'mistralai/Mistral-7B-v0.1' • Problem: The script is unable to find or load the configuration file for the Mistral-7B-v0.1 model. This could be due to an incorrect path or missing configuration file. Warning and Information Messages

  2. Several warning and info messages are present in the output. Let's address each of them:

    • Pandas Version Warning UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed). • Problem: The installed version of the bottleneck library is outdated. • Solution: Update the bottleneck package to version 1.3.6 or newer using pip install --upgrade bottleneck. • Updated bottleneck to the latest but the issue is still persistent
  1. Framework Not Specified Warning Framework not specified. Using pt to export the model. • Problem: The framework (PyTorch or TensorFlow) is not specified for model export. • Solution: Explicitly specify the framework for exporting the model by setting the appropriate configuration parameter.

The solutions are recommended to be updated on the main codebase. However, please let me know if anything else is a better option to run. Thanks