Olive workflow for mistral model optimization does not work

Describe the bug Following the instructions in examples/mistral does not result in a quantized onnx model. After running the workflow, my output_model folder within the cache directory contains an onnx model that is 27 GB on disk and the models folder does not contain a quantized model.

To Reproduce Follow the instructions in examples/mistral to run the optimization on CPU using: python mistral.py --optimize --config mistral_int4_optimize.json

Expected behavior Expected to obtain an output model that is around 3.5 GB in the models directory.

Olive config Available here

{
    "input_model": {
        "type": "PyTorchModel",
        "config": {
            "hf_config": {
                "model_name": "mistralai/Mistral-7B-v0.1",
                "model_class": "MistralForCausalLM"
            }
        }
    },
    "systems": {
        "local_system": {
            "type": "LocalSystem",
            "config": {
                "accelerators": [
                    {
                        "device": "cpu",
                        "execution_providers": [
                            "CPUExecutionProvider"
                        ]
                    }
                ]
            }
        }
    },
    "evaluators": {
        "common_evaluator": {
            "metrics": [
                {
                    "name": "latency",
                    "type": "latency",
                    "sub_types": [
                        {
                            "name": "avg",
                            "priority": 1
                        }
                    ],
                    "user_config": {
                        "user_script": "user_script.py",
                        "dataloader_func": "create_dataloader",
                        "batch_size": 1,
                        "inference_settings" : {
                            "onnx": {
                                "session_options": {
                                    "enable_profiling": false
                                }
                            }
                        }
                    }
                }
            ]
        }
    },
    "passes": {
        "convert": {
            "type": "OptimumConversion",
            "config": {
                "target_opset": 14,
                "extra_args": {
                    "legacy": false,
                    "no_post_process": false
                }
            }
        },
        "optimize": {
            "type": "OrtTransformersOptimization",
            "config": {
                "model_type": "gpt2",
                "use_gpu": false,
                "keep_io_types": true,
                "optimization_options": {
                    "use_multi_head_attention": false
                },
                "save_as_external_data": true,
                "all_tensors_to_one_file": true
            }
        },
        "quantization": {
            "type": "IncStaticQuantization",
            "config": {
                "user_script": "user_script.py",
                "approach": "weight_only",
                "weight_only_config": {
                    "bits": 4,
                    "algorithm": "GPTQ"
                },
                "recipes":{
                    "gptq_args": {
                        "accuracy_level": 0
                    }
                },
                "dataloader_func": "calib_dataloader",
                "calibration_sampling_size": [
                    8
                ],
                "save_as_external_data": true,
                "all_tensors_to_one_file": true,
                "diagnosis": false
            }
        }
    },
    "pass_flows": [
        [
            "convert",
            "optimize",
            "quantization"
        ]
    ],
    "engine": {
        "evaluate_input_model": false,
        "evaluator": "common_evaluator",
        "host": "local_system",
        "target": "local_system",
        "cache_dir": "cache",
        "output_dir": "models",
        "output_name": "mistral_int4"
    }
}

Olive logs C:\Olive\examples\mistral>python mistral.py --optimize --config mistral_int4_optimize.json Optimizing mistralai/Mistral-7B-v0.1 [2024-04-11 15:14:42,927] [INFO] [run.py:243:run] Loading Olive module configuration from: C:\Olive\olive\olive_config.json [2024-04-11 15:14:42,933] [INFO] [accelerator.py:324:create_accelerators] Running workflow on accelerator specs: cpu-cpu [2024-04-11 15:14:42,934] [INFO] [run.py:196:run_engine] Importing pass module OptimumConversion [2024-04-11 15:14:42,934] [INFO] [run.py:196:run_engine] Importing pass module OrtTransformersOptimization [2024-04-11 15:14:42,935] [INFO] [run.py:196:run_engine] Importing pass module IncStaticQuantization [2024-04-11 15:14:42,936] [INFO] [engine.py:106:initialize] Using cache directory: cache [2024-04-11 15:14:42,937] [INFO] [engine.py:262:run] Running Olive on accelerator: cpu-cpu [2024-04-11 15:14:43,817] [INFO] [engine.py:864:_run_pass] Running pass convert:OptimumConversion Framework not specified. Using pt to export the model. Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00, 3.22s/it] Automatic task detection to text-generation-with-past (possible synonyms are: causal-lm-with-past). Using the export variant default. Available variants are:

default: The default ONNX variant. Using framework PyTorch: 2.2.1+cu121 Overriding 1 configuration item(s)
- use_cache -> True C:\MiniConda3\envs\myonnxrt\lib\site-packages\transformers\modeling_attn_mask_utils.py:114: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if (input_shape[-1] > 1 or self.sliding_window is not None) and self.is_causal: C:\MiniConda3\envs\myonnxrt\lib\site-packages\optimum\exporters\onnx\model_patcher.py:301: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if past_key_values_length > 0: C:\MiniConda3\envs\myonnxrt\lib\site-packages\transformers\models\mistral\modeling_mistral.py:120: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if seq_len > self.max_seq_len_cached: C:\MiniConda3\envs\myonnxrt\lib\site-packages\transformers\models\mistral\modeling_mistral.py:676: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if attention_mask.size() != (bsz, 1, q_len, kv_seq_len): In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode Saving external data to one file... Post-processing the exported models... Deduplicating shared (tied) weights... Validating ONNX model cache/models/0_OptimumConversion-d3eae021dc4ad3d4cdbc16eba52ef561-ad904e90276e2793a36f3373323e91e1/output_model/model.onnx... -[✓] ONNX model output names match reference model (present.31.key, present.18.key, present.13.value, present.0.value, present.7.key, present.20.value, present.15.key, present.3.key, present.18.value, present.29.value, present.14.value, present.4.value, present.9.value, present.26.key, present.24.value, present.27.key, present.23.value, present.10.value, present.6.value, present.28.key, present.4.key, present.8.key, present.17.key, present.1.key, present.27.value, present.16.value, present.11.key, present.15.value, present.23.key, present.21.key, present.5.key, present.7.value, present.21.value, present.26.value, present.30.key, present.0.key, present.2.value, present.11.value, present.9.key, present.16.key, present.17.value, present.19.value, present.10.key, present.20.key, present.25.value, present.31.value, present.29.key, present.2.key, present.25.key, present.28.value, present.8.value, present.24.key, present.30.value, present.12.value, present.13.key, present.22.key, present.22.value, present.12.key, present.19.key, present.14.key, present.1.value, present.6.key, logits, present.3.value, present.5.value)
- Validating ONNX Model output "logits": -[✓] (2, 16, 32000) matches (2, 16, 32000) -[x] values not close enough, max diff: 3.62396240234375e-05 (atol: 1e-05)
- Validating ONNX Model output "present.0.key": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.0.value": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.1.key": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.1.value": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.2.key": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.2.value": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.3.key": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.3.value": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.4.key": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.4.value": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.5.key": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.5.value": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.6.key": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.6.value": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.7.key": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.7.value": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.8.key": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.8.value": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.9.key": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.9.value": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.10.key": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.10.value": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.11.key": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.11.value": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.12.key": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.12.value": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.13.key": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[x] values not close enough, max diff: 1.5854835510253906e-05 (atol: 1e-05)
- Validating ONNX Model output "present.13.value": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.14.key": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.14.value": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.15.key": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[x] values not close enough, max diff: 1.7523765563964844e-05 (atol: 1e-05)
- Validating ONNX Model output "present.15.value": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.16.key": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[x] values not close enough, max diff: 2.0742416381835938e-05 (atol: 1e-05)
- Validating ONNX Model output "present.16.value": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.17.key": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[x] values not close enough, max diff: 2.6702880859375e-05 (atol: 1e-05)
- Validating ONNX Model output "present.17.value": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.18.key": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[x] values not close enough, max diff: 3.0279159545898438e-05 (atol: 1e-05)
- Validating ONNX Model output "present.18.value": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.19.key": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[x] values not close enough, max diff: 4.1961669921875e-05 (atol: 1e-05)
- Validating ONNX Model output "present.19.value": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.20.key": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[x] values not close enough, max diff: 4.935264587402344e-05 (atol: 1e-05)
- Validating ONNX Model output "present.20.value": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.21.key": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[x] values not close enough, max diff: 5.6743621826171875e-05 (atol: 1e-05)
- Validating ONNX Model output "present.21.value": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.22.key": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[x] values not close enough, max diff: 5.91278076171875e-05 (atol: 1e-05)
- Validating ONNX Model output "present.22.value": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.23.key": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[x] values not close enough, max diff: 5.5789947509765625e-05 (atol: 1e-05)
- Validating ONNX Model output "present.23.value": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.24.key": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[x] values not close enough, max diff: 4.0531158447265625e-05 (atol: 1e-05)
- Validating ONNX Model output "present.24.value": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.25.key": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[x] values not close enough, max diff: 3.4809112548828125e-05 (atol: 1e-05)
- Validating ONNX Model output "present.25.value": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.26.key": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[x] values not close enough, max diff: 3.814697265625e-05 (atol: 1e-05)
- Validating ONNX Model output "present.26.value": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[x] values not close enough, max diff: 1.1026859283447266e-05 (atol: 1e-05)
- Validating ONNX Model output "present.27.key": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[x] values not close enough, max diff: 2.956390380859375e-05 (atol: 1e-05)
- Validating ONNX Model output "present.27.value": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.28.key": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[x] values not close enough, max diff: 3.0040740966796875e-05 (atol: 1e-05)
- Validating ONNX Model output "present.28.value": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[x] values not close enough, max diff: 1.2159347534179688e-05 (atol: 1e-05)
- Validating ONNX Model output "present.29.key": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[x] values not close enough, max diff: 1.7642974853515625e-05 (atol: 1e-05)
- Validating ONNX Model output "present.29.value": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[x] values not close enough, max diff: 1.9088387489318848e-05 (atol: 1e-05)
- Validating ONNX Model output "present.30.key": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[x] values not close enough, max diff: 1.9550323486328125e-05 (atol: 1e-05)
- Validating ONNX Model output "present.30.value": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[x] values not close enough, max diff: 1.519918441772461e-05 (atol: 1e-05)
- Validating ONNX Model output "present.31.key": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.31.value": -[✓] (2, 8, 32, 128) matches (2, 8, 32, 128) -[x] values not close enough, max diff: 1.52587890625e-05 (atol: 1e-05) The ONNX export succeeded with the warning: The maximum absolute difference between the output of the reference model and the ONNX exported model is not within the set tolerance 1e-05:
  - logits: max diff = 3.62396240234375e-05
  - present.13.key: max diff = 1.5854835510253906e-05
  - present.15.key: max diff = 1.7523765563964844e-05
  - present.16.key: max diff = 2.0742416381835938e-05
  - present.17.key: max diff = 2.6702880859375e-05
  - present.18.key: max diff = 3.0279159545898438e-05
  - present.19.key: max diff = 4.1961669921875e-05
  - present.20.key: max diff = 4.935264587402344e-05
  - present.21.key: max diff = 5.6743621826171875e-05
  - present.22.key: max diff = 5.91278076171875e-05
  - present.23.key: max diff = 5.5789947509765625e-05
  - present.24.key: max diff = 4.0531158447265625e-05
  - present.25.key: max diff = 3.4809112548828125e-05
  - present.26.key: max diff = 3.814697265625e-05
  - present.26.value: max diff = 1.1026859283447266e-05
  - present.27.key: max diff = 2.956390380859375e-05
  - present.28.key: max diff = 3.0040740966796875e-05
  - present.28.value: max diff = 1.2159347534179688e-05
  - present.29.key: max diff = 1.7642974853515625e-05
  - present.29.value: max diff = 1.9088387489318848e-05
  - present.30.key: max diff = 1.9550323486328125e-05
  - present.30.value: max diff = 1.519918441772461e-05
  - present.31.value: max diff = 1.52587890625e-05. The exported model was saved at: cache/models/0_OptimumConversion-d3eae021dc4ad3d4cdbc16eba52ef561-ad904e90276e2793a36f3373323e91e1/output_model [2024-04-11 15:23:26,254] [INFO] [engine.py:951:_run_pass] Pass convert:OptimumConversion finished in 522.433565 seconds [2024-04-11 15:23:26,296] [INFO] [engine.py:864:_run_pass] Running pass optimize:OrtTransformersOptimization

Other information

OS: Windows 11 Pro
Olive version: 0.6.0
ONNXRuntime package and version: onnxruntime-gpu 1.17.1

Additional context It appears that the quantization is not being performed at all. So checking out what the issue is.

what's the full log? it seems the cache folder contains the converted model.

That is the full log. I figured out there was some issue in optimizing the converted model. So I made the following changes to the mistral_int4_optimize.json config file by removing the "optimize" from the pass_flows and updating as follows:

"pass_flows": [
        [
            "convert",
            "quantization"
        ]

When I ran the script again, it seems to produce a quantized model, but that is only 1.45 GB on disk. I tried running the model using the CPUExecutionProvider and then the CUDAExecutionProvider, but it gives a runtime error:

Traceback (most recent call last):
  File "c:\onnxrt\mainonnx.py", line 42, in <module>
    sess = InferenceSession(hf_model_path + "/model.onnx",
  File "C:\MiniConda3\envs\myonnxrt\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "C:\MiniConda3\envs\myonnxrt\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 483, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Deserialize tensor onnx::MatMul_10759_Q4G32 failed.tensorprotoutils.cc:904 onnxruntime::utils::GetExtDataFromTensorProto External initializer: onnx::MatMul_10759_Q4G32 offset: 3208609792 size to read: 8388608 given file_length: 1559232512 are out of bounds or can not be read in full.

If above is full log, I am guessing your hit out of memory when optimize the converted onnx model. the oom will kill the python process by OS.

Could you try find bigger memory and retry?

Yes, I am on it. The quantization took around 3.5 hours on my Intel i9-13980HX CPU. So it is time consuming to test mistral_int4_optimize.json on different systems. How can mistral_fp16_optimize.json be modified so that I can try INT4 GPTQ quantization on my GPU with CUDAExecutionProvider?

Thank you @guotuofeng. I can confirm that the examples mistral_int4_optimize.json and mistral_fp16_optimize.json work.

If anyone faces similar issues, make sure that you have sufficient disk space (around 100 GB or more). The disk space seemed to be a bottleneck for me and not the RAM. I tested it on two computers with 64 GB RAM and it worked well. Here are some details for the mistral_int4_optimize.json workflow:

The mistral_int4_optimize.json workflow took me around 3.5 hr to run on high-end CPUs.
The quantized model is 4.76 GB on disk.

I faced some other issues such as the resulting quantized model's responses being very poor and the CUDAExecutionProvider not working with a recent Nvidia SUPER graphics card that I am using. I will try to fix them and get back if needed.

Using mistral.py, we can carry out inference using the CUDAExecutionProvider or on the CPU. How can we perform inference on the GPU using DmlExecutionProvider?

onnxruntime-genai seemed to be an option, but it does not yet have support for DmlExecutionProvider.

could you try https://microsoft.github.io/Olive/api/passes.html#cmdoption-arg-115 with backend onnxrt_dml_ep? I am not sure whether the int4 quantization works or not against dml.

I suppose you meant onnxrt_dml_ep and not onnxrt_dnnl_ep. Anyway, I tried both.

TRIAL 1: I updated the mistral_int4_optimize.json as follows. I added "backend": "onnxrt_dnnl_ep" for IncStaticQuantization. While running the workflow, it warns as follows: Specified provider 'DnnlExecutionProvider' is not in available provider names. Fallback to available providers: 'DmlExecutionProvider, CPUExecutionProvider'. The workflow finished relatively quickly and the resulting 'quantized' model is 27 GB on disk. Olive configuration:

{
    "input_model": {
        "type": "PyTorchModel",
        "config": {
            "hf_config": {
                "model_name": "mistralai/Mistral-7B-Instruct-v0.1",
                "model_class": "MistralForCausalLM"
            }
        }
    },
    "systems": {
        "local_system": {
            "type": "LocalSystem",
            "config": {
                "accelerators": [
                    {
                        "device": "gpu",
                        "execution_providers": [
                            "DmlExecutionProvider"
                        ]
                    }
                ]
            }
        }
    },
    "evaluators": {
        "common_evaluator": {
            "metrics": [
                {
                    "name": "latency",
                    "type": "latency",
                    "sub_types": [
                        {
                            "name": "avg",
                            "priority": 1
                        }
                    ],
                    "user_config": {
                        "user_script": "user_script.py",
                        "dataloader_func": "create_dataloader",
                        "batch_size": 1,
                        "inference_settings" : {
                            "onnx": {
                                "session_options": {
                                    "enable_profiling": false
                                }
                            }
                        }
                    }
                }
            ]
        }
    },
    "passes": {
        "convert": {
            "type": "OptimumConversion",
            "config": {
                "target_opset": 14,
                "extra_args": {
                    "legacy": false,
                    "no_post_process": false
                }
            }
        },
        "optimize": {
            "type": "OrtTransformersOptimization",
            "config": {
                "model_type": "gpt2",
                "use_gpu": false,
                "keep_io_types": true,
                "optimization_options": {
                    "use_multi_head_attention": false
                },
                "save_as_external_data": true,
                "all_tensors_to_one_file": true
            }
        },
        "quantization": {
            "type": "IncStaticQuantization",
            "config": {
                "backend": "onnxrt_dnnl_ep",
                "user_script": "user_script.py",
                "approach": "weight_only",
                "weight_only_config": {
                    "bits": 4,
                    "algorithm": "GPTQ"
                },
                "recipes":{
                    "gptq_args": {
                        "accuracy_level": 0
                    }
                },
                "dataloader_func": "calib_dataloader",
                "calibration_sampling_size": [
                    8
                ],
                "save_as_external_data": true,
                "all_tensors_to_one_file": true,
                "diagnosis": false
            }
        }
    },
    "pass_flows": [
        [
            "convert",
            "optimize",
            "quantization"
        ]
    ],
    "engine": {
        "evaluate_input_model": false,
        "evaluator": "common_evaluator",
        "host": "local_system",
        "target": "local_system",
        "cache_dir": "cache",
        "output_dir": "models",
        "output_name": "mistral_int4_dml"
    }
}

Here is the full log:

C:\Olive\examples\mistral>python mistral.py --optimize --config mistral_int4_optimize.json
Optimizing mistralai/Mistral-7B-v0.1
[2024-04-18 11:07:03,572] [INFO] [run.py:261:run] Loading Olive module configuration from: C:\Olive\olive\olive_config.json
[2024-04-18 11:07:03,588] [INFO] [accelerator.py:336:create_accelerators] Running workflow on accelerator specs: gpu-dml
[2024-04-18 11:07:03,588] [INFO] [engine.py:106:initialize] Using cache directory: cache
[2024-04-18 11:07:03,588] [INFO] [engine.py:262:run] Running Olive on accelerator: gpu-dml
[2024-04-18 11:07:04,825] [INFO] [engine.py:864:_run_pass] Running pass convert:OptimumConversion
[2024-04-18 11:07:04,825] [INFO] [engine.py:898:_run_pass] Loaded model from cache: 17_OptimumConversion-5af0f1a930787dedd19fa4814997b8a4-ad904e90276e2793a36f3373323e91e1 from cache\runs
[2024-04-18 11:07:04,825] [INFO] [engine.py:864:_run_pass] Running pass optimize:OrtTransformersOptimization
[2024-04-18 11:35:59,179] [INFO] [engine.py:951:_run_pass] Pass optimize:OrtTransformersOptimization finished in 1734.338259 seconds
[2024-04-18 11:35:59,187] [INFO] [engine.py:864:_run_pass] Running pass quantization:IncStaticQuantization
[2024-04-18 11:36:05,418] [WARNING] [inc_quantization.py:440:_set_tuning_config] 'metric' is not set for INC Quantization Pass. Intel® Neural Compressor will quantize model without accuracy aware tuning. Please set 'metric' if you want to use Intel® Neural Compressorquantization with accuracy aware tuning.
2024-04-18 11:37:22 [INFO] Start auto tuning.
2024-04-18 11:37:22 [INFO] Quantize model without tuning!
2024-04-18 11:37:22 [INFO] Quantize the model with default configuration without evaluating the model.                To perform the tuning process, please either provide an eval_func or provide an                    eval_dataloader an eval_metric.
2024-04-18 11:37:22 [INFO] Adaptor has 5 recipes.
2024-04-18 11:37:22 [INFO] 0 recipes specified by user.
2024-04-18 11:37:22 [INFO] 3 recipes require future tuning.
2024-04-18 11:37:22 [WARNING] Specified provider 'DnnlExecutionProvider' is not in available provider names. Fallback to available providers: 'DmlExecutionProvider, CPUExecutionProvider'
2024-04-18 11:37:22 [INFO] *** Initialize auto tuning
Exception in thread Thread-4:
2024-04-18 11:37:22 [INFO] {
Traceback (most recent call last):
2024-04-18 11:37:22 [INFO]     'PostTrainingQuantConfig': {
  File "C:\Anaconda\envs\myolive\lib\threading.py", line 980, in _bootstrap_inner
2024-04-18 11:37:22 [INFO]         'AccuracyCriterion': {
2024-04-18 11:37:22 [INFO]             'criterion': 'relative',
2024-04-18 11:37:22 [INFO]             'higher_is_better': True,
2024-04-18 11:37:22 [INFO]             'tolerable_loss': 0.01,
2024-04-18 11:37:22 [INFO]             'absolute': None,
2024-04-18 11:37:22 [INFO]             'keys': <bound method AccuracyCriterion.keys of <neural_compressor.config.AccuracyCriterion object at 0x000001A64EFA7550>>,
2024-04-18 11:37:22 [INFO]             'relative': 0.01
2024-04-18 11:37:22 [INFO]         },
2024-04-18 11:37:22 [INFO]         'approach': 'post_training_weight_only',
2024-04-18 11:37:22 [INFO]         'backend': 'onnxrt_dnnl_ep',
2024-04-18 11:37:22 [INFO]         'calibration_sampling_size': [
2024-04-18 11:37:22 [INFO]             8
2024-04-18 11:37:22 [INFO]         ],
2024-04-18 11:37:22 [INFO]         'device': 'cpu',
2024-04-18 11:37:22 [INFO]         'diagnosis': False,
2024-04-18 11:37:22 [INFO]         'domain': 'auto',
2024-04-18 11:37:22 [INFO]         'example_inputs': 'Not printed here due to large size tensors...',
2024-04-18 11:37:22 [INFO]         'excluded_precisions': [
2024-04-18 11:37:22 [INFO]         ],
2024-04-18 11:37:22 [INFO]         'framework': 'onnxruntime',
2024-04-18 11:37:22 [INFO]         'inputs': [
2024-04-18 11:37:22 [INFO]         ],
2024-04-18 11:37:22 [INFO]         'model_name': '',
2024-04-18 11:37:22 [INFO]         'ni_workload_name': 'quantization',
2024-04-18 11:37:22 [INFO]         'op_name_dict': None,
2024-04-18 11:37:22 [INFO]         'op_type_dict': {
2024-04-18 11:37:22 [INFO]             '.*': {
2024-04-18 11:37:22 [INFO]                 'weight': {
2024-04-18 11:37:22 [INFO]                     'bits': [
2024-04-18 11:37:22 [INFO]                         4
2024-04-18 11:37:22 [INFO]                     ],
2024-04-18 11:37:22 [INFO]                     'group_size': [
2024-04-18 11:37:22 [INFO]                         32
2024-04-18 11:37:22 [INFO]                     ],
2024-04-18 11:37:22 [INFO]                     'scheme': [
2024-04-18 11:37:22 [INFO]                         'asym'
2024-04-18 11:37:22 [INFO]                     ],
2024-04-18 11:37:22 [INFO]                     'algorithm': [
2024-04-18 11:37:22 [INFO]                         'GPTQ'
2024-04-18 11:37:22 [INFO]                     ]
2024-04-18 11:37:22 [INFO]                 }
2024-04-18 11:37:22 [INFO]             }
2024-04-18 11:37:22 [INFO]         },
2024-04-18 11:37:22 [INFO]         'outputs': [
2024-04-18 11:37:22 [INFO]         ],
2024-04-18 11:37:22 [INFO]         'quant_format': 'QOperator',
    self.run()
  File "C:\Anaconda\envs\myolive\lib\threading.py", line 1304, in run
2024-04-18 11:37:22 [INFO]         'quant_level': 'auto',
2024-04-18 11:37:22 [INFO]         'recipes': {
2024-04-18 11:37:22 [INFO]             'smooth_quant': False,
2024-04-18 11:37:22 [INFO]             'smooth_quant_args': {
2024-04-18 11:37:22 [INFO]             },
2024-04-18 11:37:22 [INFO]             'layer_wise_quant': False,
2024-04-18 11:37:22 [INFO]             'layer_wise_quant_args': {
2024-04-18 11:37:22 [INFO]             },
2024-04-18 11:37:22 [INFO]             'fast_bias_correction': False,
    self.finished.wait(self.interval)
  File "C:\Anaconda\envs\myolive\lib\threading.py", line 581, in wait
2024-04-18 11:37:22 [INFO]             'weight_correction': False,
2024-04-18 11:37:22 [INFO]             'gemm_to_matmul': True,
2024-04-18 11:37:22 [INFO]             'graph_optimization_level': None,
2024-04-18 11:37:22 [INFO]             'first_conv_or_matmul_quantization': True,
2024-04-18 11:37:22 [INFO]             'last_conv_or_matmul_quantization': True,
2024-04-18 11:37:22 [INFO]             'pre_post_process_quantization': True,
2024-04-18 11:37:22 [INFO]             'add_qdq_pair_to_weight': False,
    signaled = self._cond.wait(timeout)
2024-04-18 11:37:22 [INFO]             'optypes_to_exclude_output_quant': [
  File "C:\Anaconda\envs\myolive\lib\threading.py", line 316, in wait
2024-04-18 11:37:22 [INFO]             ],
2024-04-18 11:37:22 [INFO]             'dedicated_qdq_pair': False,
2024-04-18 11:37:22 [INFO]             'rtn_args': {
2024-04-18 11:37:22 [INFO]             },
2024-04-18 11:37:22 [INFO]             'awq_args': {
2024-04-18 11:37:22 [INFO]             },
2024-04-18 11:37:22 [INFO]             'gptq_args': {
2024-04-18 11:37:22 [INFO]                 'accuracy_level': 0
    gotit = waiter.acquire(True, timeout)
2024-04-18 11:37:22 [INFO]             },
OverflowError: timeout value is too large
2024-04-18 11:37:22 [INFO]             'teq_args': {
2024-04-18 11:37:22 [INFO]             },
2024-04-18 11:37:22 [INFO]             'autoround_args': {
2024-04-18 11:37:22 [INFO]             }
2024-04-18 11:37:22 [INFO]         },
2024-04-18 11:37:22 [INFO]         'reduce_range': False,
2024-04-18 11:37:22 [INFO]         'TuningCriterion': {
2024-04-18 11:37:22 [INFO]             'max_trials': 100,
2024-04-18 11:37:22 [INFO]             'objective': [
2024-04-18 11:37:22 [INFO]                 'performance'
2024-04-18 11:37:22 [INFO]             ],
2024-04-18 11:37:22 [INFO]             'strategy': 'basic',
2024-04-18 11:37:22 [INFO]             'strategy_kwargs': None,
2024-04-18 11:37:22 [INFO]             'timeout': 0
2024-04-18 11:37:22 [INFO]         },
2024-04-18 11:37:22 [INFO]         'use_bf16': True
2024-04-18 11:37:22 [INFO]     }
2024-04-18 11:37:22 [INFO] }
2024-04-18 11:37:22 [WARNING] [Strategy] Please install `mpi4py` correctly if using distributed tuning; otherwise, ignore this warning.
2024-04-18 11:37:22 [WARNING] The model is automatically detected as a non-NLP model. You can use 'domain' argument in 'PostTrainingQuantConfig' to overwrite it
2024-04-18 11:37:22 [WARNING] Graph optimization level is automatically set to ENABLE_BASIC. You can use 'recipe' argument in 'PostTrainingQuantConfig'to overwrite it
C:\Anaconda\envs\myolive\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py:69: UserWarning: Specified provider 'DnnlExecutionProvider' is not in available provider names.Available providers: 'DmlExecutionProvider, CPUExecutionProvider'
  warnings.warn(
2024-04-18 11:38:05 [INFO] Do not evaluate the baseline and quantize the model with default configuration.
2024-04-18 11:38:05 [INFO] Quantize the model with default config.
2024-04-18 11:38:07 [INFO] |******Mixed Precision Statistics******|
2024-04-18 11:38:07 [INFO] +---------------------+----------------+
2024-04-18 11:38:07 [INFO] |       Op Type       |     Total      |
2024-04-18 11:38:07 [INFO] +---------------------+----------------+
2024-04-18 11:38:07 [INFO] +---------------------+----------------+
2024-04-18 11:38:07 [INFO] Pass quantize model elapsed time: 1917.88 ms
2024-04-18 11:38:07 [INFO] Save tuning history to C:\Olive\examples\mistral\nc_workspace\2024-04-18_11-35-59\./history.snapshot.
2024-04-18 11:38:07 [INFO] [Strategy] Found the model meets accuracy requirements, ending the tuning process.
2024-04-18 11:38:07 [INFO] Specified timeout or max trials is reached! Found a quantized model which meet accuracy goal. Exit.
2024-04-18 11:38:07 [INFO] Save deploy yaml to C:\Olive\examples\mistral\nc_workspace\2024-04-18_11-35-59\deploy.yaml
[2024-04-18 11:38:28,125] [INFO] [engine.py:951:_run_pass] Pass quantization:IncStaticQuantization finished in 148.929999 seconds
[2024-04-18 11:38:28,125] [INFO] [engine.py:842:_run_passes] Run model evaluation for the final model...
[2024-04-18 11:39:55,095] [INFO] [engine.py:361:run_accelerator] Save footprint to models\mistral_int4_dml_gpu-dml_footprints.json.
[2024-04-18 11:39:55,099] [INFO] [engine.py:279:run] Run history for gpu-dml:
[2024-04-18 11:39:55,117] [INFO] [engine.py:567:dump_run_history] run history:
+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-----------------------------+----------------+----------------------------+
| model_id                                                                               | parent_model_id                                                                        | from_pass                   |   duration_sec | metrics                    |
+========================================================================================+========================================================================================+=============================+================+============================+
| 5af0f1a930787dedd19fa4814997b8a4                                                       |                                                                                        |                             |                |                            |
+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-----------------------------+----------------+----------------------------+
| 17_OptimumConversion-5af0f1a930787dedd19fa4814997b8a4-ad904e90276e2793a36f3373323e91e1 | 5af0f1a930787dedd19fa4814997b8a4                                                       | OptimumConversion           |        378.937 |                            |
+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-----------------------------+----------------+----------------------------+
| 21_OrtTransformersOptimization-17-7ee24d7faf207d244aa16596fc4f536c-gpu-dml             | 17_OptimumConversion-5af0f1a930787dedd19fa4814997b8a4-ad904e90276e2793a36f3373323e91e1 | OrtTransformersOptimization |       1734.34  |                            |
+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-----------------------------+----------------+----------------------------+
| 22_IncStaticQuantization-21-d038bb1662e6b6fb8eec0b99098940cb-gpu-dml                   | 21_OrtTransformersOptimization-17-7ee24d7faf207d244aa16596fc4f536c-gpu-dml             | IncStaticQuantization       |        148.93  | {                          |
|                                                                                        |                                                                                        |                             |                |   "latency-avg": 520.68174 |
|                                                                                        |                                                                                        |                             |                | }                          |
+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-----------------------------+----------------+----------------------------+
[2024-04-18 11:39:55,119] [INFO] [engine.py:294:run] No packaging config provided, skip packaging artifacts

TRIAL 2: I changed "backend": "onnxrt_dnnl_ep" to "backend": "onnxrt_dml_ep" and ran the workflow again. It resulted in a few warnings and errors. A couple of noteworthy warnings from the log are as follows:

[WARNING] Backend onnxrt_dml_ep requires a NPU device. Reset device to 'npu'.
[WARNING] [engine.py:357:run_accelerator] Failed to run Olive on gpu-dml.

Here is the full log:

C:\Olive\examples\mistral>python mistral.py --optimize --config mistral_int4_optimize.json
Optimizing mistralai/Mistral-7B-v0.1
[2024-04-18 12:18:02,387] [INFO] [run.py:261:run] Loading Olive module configuration from: C:\Olive\olive\olive_config.json
[2024-04-18 12:18:02,406] [INFO] [accelerator.py:336:create_accelerators] Running workflow on accelerator specs: gpu-dml
[2024-04-18 12:18:02,406] [INFO] [engine.py:106:initialize] Using cache directory: cache
[2024-04-18 12:18:02,406] [INFO] [engine.py:262:run] Running Olive on accelerator: gpu-dml
[2024-04-18 12:18:04,096] [INFO] [engine.py:864:_run_pass] Running pass convert:OptimumConversion
[2024-04-18 12:18:04,096] [INFO] [engine.py:898:_run_pass] Loaded model from cache: 17_OptimumConversion-5af0f1a930787dedd19fa4814997b8a4-ad904e90276e2793a36f3373323e91e1 from cache\runs
[2024-04-18 12:18:04,096] [INFO] [engine.py:864:_run_pass] Running pass optimize:OrtTransformersOptimization
[2024-04-18 12:18:04,096] [INFO] [engine.py:898:_run_pass] Loaded model from cache: 21_OrtTransformersOptimization-17-7ee24d7faf207d244aa16596fc4f536c-gpu-dml from cache\runs
[2024-04-18 12:18:04,096] [INFO] [engine.py:864:_run_pass] Running pass quantization:IncStaticQuantization
[2024-04-18 12:18:07,192] [WARNING] [inc_quantization.py:440:_set_tuning_config] 'metric' is not set for INC Quantization Pass. Intel® Neural Compressor will quantize model without accuracy aware tuning. Please set 'metric' if you want to use Intel® Neural Compressorquantization with accuracy aware tuning.
2024-04-18 12:19:06 [INFO] Start auto tuning.
2024-04-18 12:19:06 [INFO] Quantize model without tuning!
2024-04-18 12:19:06 [INFO] Quantize the model with default configuration without evaluating the model.                To perform the tuning process, please either provide an eval_func or provide an                    eval_dataloader an eval_metric.
2024-04-18 12:19:06 [INFO] Adaptor has 5 recipes.
2024-04-18 12:19:06 [INFO] 0 recipes specified by user.
2024-04-18 12:19:06 [INFO] 3 recipes require future tuning.
2024-04-18 12:19:06 [WARNING] Backend `onnxrt_dml_ep` requires a NPU device. Reset device to 'npu'.
2024-04-18 12:19:06 [INFO] *** Initialize auto tuning
Exception in thread Thread-4:
Traceback (most recent call last):
  File "C:\Anaconda\envs\myolive\lib\threading.py", line 980, in _bootstrap_inner
2024-04-18 12:19:06 [INFO] {
2024-04-18 12:19:06 [INFO]     'PostTrainingQuantConfig': {
2024-04-18 12:19:06 [INFO]         'AccuracyCriterion': {
2024-04-18 12:19:06 [INFO]             'criterion': 'relative',
2024-04-18 12:19:06 [INFO]             'higher_is_better': True,
2024-04-18 12:19:06 [INFO]             'tolerable_loss': 0.01,
2024-04-18 12:19:06 [INFO]             'absolute': None,
2024-04-18 12:19:06 [INFO]             'keys': <bound method AccuracyCriterion.keys of <neural_compressor.config.AccuracyCriterion object at 0x000001E3B51BBF70>>,
2024-04-18 12:19:06 [INFO]             'relative': 0.01
2024-04-18 12:19:06 [INFO]         },
2024-04-18 12:19:06 [INFO]         'approach': 'post_training_weight_only',
2024-04-18 12:19:06 [INFO]         'backend': 'onnxrt_dml_ep',
2024-04-18 12:19:06 [INFO]         'calibration_sampling_size': [
2024-04-18 12:19:06 [INFO]             8
    self.run()
  File "C:\Anaconda\envs\myolive\lib\threading.py", line 1304, in run
2024-04-18 12:19:06 [INFO]         ],
2024-04-18 12:19:06 [INFO]         'device': 'cpu',
2024-04-18 12:19:06 [INFO]         'diagnosis': False,
2024-04-18 12:19:06 [INFO]         'domain': 'auto',
2024-04-18 12:19:06 [INFO]         'example_inputs': 'Not printed here due to large size tensors...',
2024-04-18 12:19:06 [INFO]         'excluded_precisions': [
2024-04-18 12:19:06 [INFO]         ],
    self.finished.wait(self.interval)
  File "C:\Anaconda\envs\myolive\lib\threading.py", line 581, in wait
2024-04-18 12:19:06 [INFO]         'framework': 'onnxruntime',
2024-04-18 12:19:06 [INFO]         'inputs': [
2024-04-18 12:19:06 [INFO]         ],
2024-04-18 12:19:06 [INFO]         'model_name': '',
2024-04-18 12:19:06 [INFO]         'ni_workload_name': 'quantization',
2024-04-18 12:19:06 [INFO]         'op_name_dict': None,
2024-04-18 12:19:06 [INFO]         'op_type_dict': {
2024-04-18 12:19:06 [INFO]             '.*': {
2024-04-18 12:19:06 [INFO]                 'weight': {
    signaled = self._cond.wait(timeout)
  File "C:\Anaconda\envs\myolive\lib\threading.py", line 316, in wait
2024-04-18 12:19:06 [INFO]                     'bits': [
2024-04-18 12:19:06 [INFO]                         4
2024-04-18 12:19:06 [INFO]                     ],
2024-04-18 12:19:06 [INFO]                     'group_size': [
2024-04-18 12:19:06 [INFO]                         32
2024-04-18 12:19:06 [INFO]                     ],
2024-04-18 12:19:06 [INFO]                     'scheme': [
2024-04-18 12:19:06 [INFO]                         'asym'
    gotit = waiter.acquire(True, timeout)
2024-04-18 12:19:06 [INFO]                     ],
OverflowError: timeout value is too large
2024-04-18 12:19:06 [INFO]                     'algorithm': [
2024-04-18 12:19:06 [INFO]                         'GPTQ'
2024-04-18 12:19:06 [INFO]                     ]
2024-04-18 12:19:06 [INFO]                 }
2024-04-18 12:19:06 [INFO]             }
2024-04-18 12:19:06 [INFO]         },
2024-04-18 12:19:06 [INFO]         'outputs': [
2024-04-18 12:19:06 [INFO]         ],
2024-04-18 12:19:06 [INFO]         'quant_format': 'QOperator',
2024-04-18 12:19:06 [INFO]         'quant_level': 'auto',
2024-04-18 12:19:06 [INFO]         'recipes': {
2024-04-18 12:19:06 [INFO]             'smooth_quant': False,
2024-04-18 12:19:06 [INFO]             'smooth_quant_args': {
2024-04-18 12:19:06 [INFO]             },
2024-04-18 12:19:06 [INFO]             'layer_wise_quant': False,
2024-04-18 12:19:06 [INFO]             'layer_wise_quant_args': {
2024-04-18 12:19:06 [INFO]             },
2024-04-18 12:19:06 [INFO]             'fast_bias_correction': False,
2024-04-18 12:19:06 [INFO]             'weight_correction': False,
2024-04-18 12:19:06 [INFO]             'gemm_to_matmul': True,
2024-04-18 12:19:06 [INFO]             'graph_optimization_level': None,
2024-04-18 12:19:06 [INFO]             'first_conv_or_matmul_quantization': True,
2024-04-18 12:19:06 [INFO]             'last_conv_or_matmul_quantization': True,
2024-04-18 12:19:06 [INFO]             'pre_post_process_quantization': True,
2024-04-18 12:19:06 [INFO]             'add_qdq_pair_to_weight': False,
2024-04-18 12:19:06 [INFO]             'optypes_to_exclude_output_quant': [
2024-04-18 12:19:06 [INFO]             ],
2024-04-18 12:19:06 [INFO]             'dedicated_qdq_pair': False,
2024-04-18 12:19:06 [INFO]             'rtn_args': {
2024-04-18 12:19:06 [INFO]             },
2024-04-18 12:19:06 [INFO]             'awq_args': {
2024-04-18 12:19:06 [INFO]             },
2024-04-18 12:19:06 [INFO]             'gptq_args': {
2024-04-18 12:19:06 [INFO]                 'accuracy_level': 0
2024-04-18 12:19:06 [INFO]             },
2024-04-18 12:19:06 [INFO]             'teq_args': {
2024-04-18 12:19:06 [INFO]             },
2024-04-18 12:19:06 [INFO]             'autoround_args': {
2024-04-18 12:19:06 [INFO]             }
2024-04-18 12:19:06 [INFO]         },
2024-04-18 12:19:06 [INFO]         'reduce_range': False,
2024-04-18 12:19:06 [INFO]         'TuningCriterion': {
2024-04-18 12:19:06 [INFO]             'max_trials': 100,
2024-04-18 12:19:06 [INFO]             'objective': [
2024-04-18 12:19:06 [INFO]                 'performance'
2024-04-18 12:19:06 [INFO]             ],
2024-04-18 12:19:06 [INFO]             'strategy': 'basic',
2024-04-18 12:19:06 [INFO]             'strategy_kwargs': None,
2024-04-18 12:19:06 [INFO]             'timeout': 0
2024-04-18 12:19:06 [INFO]         },
2024-04-18 12:19:06 [INFO]         'use_bf16': True
2024-04-18 12:19:06 [INFO]     }
2024-04-18 12:19:06 [INFO] }
2024-04-18 12:19:06 [WARNING] [Strategy] Please install `mpi4py` correctly if using distributed tuning; otherwise, ignore this warning.
2024-04-18 12:19:06 [WARNING] The model is automatically detected as a non-NLP model. You can use 'domain' argument in 'PostTrainingQuantConfig' to overwrite it
2024-04-18 12:19:06 [WARNING] Graph optimization level is automatically set to ENABLE_BASIC. You can use 'recipe' argument in 'PostTrainingQuantConfig'to overwrite it
2024-04-18 12:19:40 [INFO] Do not evaluate the baseline and quantize the model with default configuration.
2024-04-18 12:19:40 [INFO] Quantize the model with default config.
2024-04-18 12:19:41 [INFO] |******Mixed Precision Statistics******|
2024-04-18 12:19:41 [INFO] +---------------------+----------------+
2024-04-18 12:19:41 [INFO] |       Op Type       |     Total      |
2024-04-18 12:19:41 [INFO] +---------------------+----------------+
2024-04-18 12:19:41 [INFO] +---------------------+----------------+
2024-04-18 12:19:41 [INFO] Pass quantize model elapsed time: 843.77 ms
2024-04-18 12:19:41 [INFO] Save tuning history to C:\Olive\examples\mistral\nc_workspace\2024-04-18_12-18-04\./history.snapshot.
2024-04-18 12:19:41 [INFO] [Strategy] Found the model meets accuracy requirements, ending the tuning process.
2024-04-18 12:19:41 [INFO] Specified timeout or max trials is reached! Found a quantized model which meet accuracy goal. Exit.
2024-04-18 12:19:41 [INFO] Save deploy yaml to C:\Olive\examples\mistral\nc_workspace\2024-04-18_12-18-04\deploy.yaml
[2024-04-18 12:20:01,362] [INFO] [engine.py:951:_run_pass] Pass quantization:IncStaticQuantization finished in 117.265407 seconds
[2024-04-18 12:20:01,374] [INFO] [engine.py:842:_run_passes] Run model evaluation for the final model...
2024-04-18 12:20:02.4692837 [E:onnxruntime:, inference_session.cc:1997 onnxruntime::InferenceSession::Initialize::<lambda_80060d29f848598faaecbd5242ad430a>::operator ()] Exception during initialization: invalid unordered_map<K, T> key
[2024-04-18 12:20:02,468] [WARNING] [engine.py:357:run_accelerator] Failed to run Olive on gpu-dml.
Traceback (most recent call last):
  File "C:\Olive\olive\engine\engine.py", line 336, in run_accelerator
    output_footprint = self.run_no_search(
  File "C:\Olive\olive\engine\engine.py", line 428, in run_no_search
    should_prune, signal, model_ids = self._run_passes(
  File "C:\Olive\olive\engine\engine.py", line 843, in _run_passes
    signal = self._evaluate_model(model_config, model_id, data_root, evaluator_config, accelerator_spec)
  File "C:\Olive\olive\engine\engine.py", line 1041, in _evaluate_model
    signal = self.target.evaluate_model(model_config, data_root, metrics, accelerator_spec)
  File "C:\Olive\olive\systems\local.py", line 46, in evaluate_model
    return evaluator.evaluate(model, data_root, metrics, device=device, execution_providers=execution_providers)
  File "C:\Olive\olive\evaluator\olive_evaluator.py", line 214, in evaluate
    metrics_res[metric.name] = self._evaluate_latency(
  File "C:\Olive\olive\evaluator\olive_evaluator.py", line 132, in _evaluate_latency
    latencies = self._evaluate_raw_latency(
  File "C:\Olive\olive\evaluator\olive_evaluator.py", line 767, in _evaluate_raw_latency
    return self._evaluate_onnx_latency(model, metric, dataloader, post_func, device, execution_providers)
  File "C:\Olive\olive\evaluator\olive_evaluator.py", line 540, in _evaluate_onnx_latency
    session, inference_settings = OnnxEvaluator.get_session_wrapper(
  File "C:\Olive\olive\evaluator\olive_evaluator.py", line 435, in get_session_wrapper
    session = model.prepare_session(
  File "C:\Olive\olive\model\handler\onnx.py", line 114, in prepare_session
    return get_ort_inference_session(
  File "C:\Olive\olive\common\ort_inference.py", line 118, in get_ort_inference_session
    session = ort.InferenceSession(
  File "C:\Anaconda\envs\myolive\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "C:\Anaconda\envs\myolive\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 483, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: invalid unordered_map<K, T> key
[2024-04-18 12:20:02,515] [INFO] [engine.py:279:run] Run history for gpu-dml:
[2024-04-18 12:20:02,531] [INFO] [engine.py:567:dump_run_history] run history:
+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-----------------------------+----------------+-----------+
| model_id                                                                               | parent_model_id                                                                        | from_pass                   |   duration_sec | metrics   |
+========================================================================================+========================================================================================+=============================+================+===========+
| 5af0f1a930787dedd19fa4814997b8a4                                                       |                                                                                        |                             |                |           |
+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-----------------------------+----------------+-----------+
| 17_OptimumConversion-5af0f1a930787dedd19fa4814997b8a4-ad904e90276e2793a36f3373323e91e1 | 5af0f1a930787dedd19fa4814997b8a4                                                       | OptimumConversion           |        378.937 |           |
+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-----------------------------+----------------+-----------+
| 21_OrtTransformersOptimization-17-7ee24d7faf207d244aa16596fc4f536c-gpu-dml             | 17_OptimumConversion-5af0f1a930787dedd19fa4814997b8a4-ad904e90276e2793a36f3373323e91e1 | OrtTransformersOptimization |       1734.34  |           |
+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-----------------------------+----------------+-----------+
| 23_IncStaticQuantization-21-b76f1bb364ef9dc8aca22db9c5b3ee30-gpu-dml                   | 21_OrtTransformersOptimization-17-7ee24d7faf207d244aa16596fc4f536c-gpu-dml             | IncStaticQuantization       |        117.265 |           |
+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-----------------------------+----------------+-----------+
[2024-04-18 12:20:02,531] [INFO] [engine.py:294:run] No packaging config provided, skip packaging artifacts

yes I mean dml ep. as for the error, we might need ask from dml ep team. @PatriceVignola, do you have any insight with this error?

@guotuofeng The following code snippet works like a charm with the INT4 model created using the scripts in examples/mistral

config = AutoConfig.from_pretrained(hfmodelpath + "/config.json")
tokenizer = AutoTokenizer.from_pretrained(hfmodelpath)

options = ort.SessionOptions()

sess = InferenceSession(hfmodelpath + "/model.onnx",
                        load_external_data=True, 
                        sess_options=options,
                        provider = "CUDAExecutionProvider")

inputs = tokenizer("The lightest element is", return_tensors="pt")

model = ORTModelForCausalLM(sess, config, use_cache=True)    
model = model.to("cuda")
inputs = inputs.to('cuda')
starttime = time.time()
outputs = model.generate(**inputs, max_new_tokens=512)
endtime = time.time()
print(f"Latency = {endtime-starttime} seconds")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

I simply want to use DmlExecutionProvider instead of CUDAExecutionProvider. I tried the following, but it results in an error.

config = AutoConfig.from_pretrained(hfmodelpath + "/config.json")
tokenizer = AutoTokenizer.from_pretrained(hfmodelpath)

options = ort.SessionOptions()

sess = InferenceSession(hfmodelpath + "/model.onnx",
                        load_external_data=True, 
                        sess_options=options,
                        provider = "DmlExecutionProvider")

inputs = tokenizer("The lightest element is", return_tensors="pt")

model = ORTModelForCausalLM(sess, config, use_cache=True)    
device = torch_directml.device(0) 
model = model.to(device)
inputs = inputs.to(device)
starttime = time.time()
outputs = model.generate(**inputs, max_new_tokens=512)
endtime = time.time()
print(f"Latency = {endtime-starttime} seconds")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

RuntimeError: Cannot access data pointer of Tensor that doesn't have storage

Do you know if I can fix this error? or is it not possible to use DmlExecutionProvider in this case?

I am not sure, I don't try DML before since we doesn't have dml GPU.

@guotuofeng Thank you for the responses.

I am now trying out LLM Optimization with DirectML, which has been updated yesterday.

Actually, some OPs is still pending to merge in that example.

Configuration Load Error OSError: Can't load the configuration of 'mistralai/Mistral-7B-v0.1' • Problem: The script is unable to find or load the configuration file for the Mistral-7B-v0.1 model. This could be due to an incorrect path or missing configuration file. Warning and Information Messages
Several warning and info messages are present in the output. Let's address each of them:
- Pandas Version Warning UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed). • Problem: The installed version of the bottleneck library is outdated. • Solution: Update the bottleneck package to version 1.3.6 or newer using pip install --upgrade bottleneck. • Updated bottleneck to the latest but the issue is still persistent

Deprecation Warning for resume_download FutureWarning: 'resume_download' is deprecated and will be removed in version 1.0.0. • Problem: The resume_download function is deprecated in the huggingface_hub library. • Solution: Modify the code to avoid using the resume_download parameter or use force_download=True instead.
TensorFlow OneDNN Warning I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. • Problem: The oneDNN custom operations are enabled, which might lead to slight numerical differences. • Solution: If these numerical differences are not critical, you can ignore this warning. Otherwise, set the environment variable TF_ENABLE_ONEDNN_OPTS=0 to disable oneDNN custom operations – not able to disable oneDNN
TensorFlow Sparse Softmax Cross Entropy Warning WARNING:tensorflow: From D:\Anaconda\lib\site-packages\tf_keras\src\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead. • Problem: The tf.losses.sparse_softmax_cross_entropy is deprecated. • Solution: Update the code to use tf.compat.v1.losses.sparse_softmax_cross_entropy.

Framework Not Specified Warning Framework not specified. Using pt to export the model. • Problem: The framework (PyTorch or TensorFlow) is not specified for model export. • Solution: Explicitly specify the framework for exporting the model by setting the appropriate configuration parameter.

The solutions are recommended to be updated on the main codebase. However, please let me know if anything else is a better option to run. Thanks

microsoft / Olive

Olive workflow for mistral model optimization does not work #1075