openvinotoolkit / openvino

OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference
https://docs.openvino.ai
Apache License 2.0
6.83k stars 2.18k forks source link

OpenVINO inference is running much slower than CPU when using integrated Intel(R) UHD Graphics 620 (iGPU) #26391

Open prashant-saxena opened 1 week ago

prashant-saxena commented 1 week ago

OpenVINO Version

openvino : 2024.3.0

Operating System

Windows System

Device used for inference

iGPU

OpenVINO installation

PyPi

Programming Language

Python

Hardware Architecture

x86 (64 bits)

Model used

codeformer

Model quantization

Yes

Target Platform

Available devices: CPU IMMUTABLE PROPERTIES: AVAILABLE_DEVICES : "" RANGE_FOR_ASYNC_INFER_REQUESTS : 1 1 1 RANGE_FOR_STREAMS : 1 8 EXECUTION_DEVICES : CPU FULL_DEVICE_NAME : Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz OPTIMIZATION_CAPABILITIES : FP32 FP16 INT8 BIN EXPORT_IMPORT DEVICE_TYPE : integrated DEVICE_ARCHITECTURE : intel64 MUTABLE PROPERTIES: NUM_STREAMS : 1 AFFINITY : NONE INFERENCE_NUM_THREADS : 0 PERF_COUNT : NO INFERENCE_PRECISION_HINT : f32 PERFORMANCE_HINT : LATENCY EXECUTION_MODE_HINT : PERFORMANCE PERFORMANCE_HINT_NUM_REQUESTS : 0 ENABLE_CPU_PINNING : YES SCHEDULING_CORE_TYPE : ANY_CORE MODEL_DISTRIBUTION_POLICY : "" ENABLE_HYPER_THREADING : YES DEVICE_ID : "" CPU_DENORMALS_OPTIMIZATION : NO LOG_LEVEL : LOG_NONE CPU_SPARSE_WEIGHTS_DECOMPRESSION_RATE : 1 DYNAMIC_QUANTIZATION_GROUP_SIZE : 0 KV_CACHE_PRECISION : f16

GPU IMMUTABLE PROPERTIES: AVAILABLE_DEVICES : 0 RANGE_FOR_ASYNC_INFER_REQUESTS : 1 2 1 RANGE_FOR_STREAMS : 1 2 OPTIMAL_BATCH_SIZE : 1 MAX_BATCH_SIZE : 1 DEVICE_ARCHITECTURE : GPU: vendor=0x8086 arch=v9.0.0 FULL_DEVICE_NAME : Intel(R) UHD Graphics 620 (iGPU) DEVICE_UUID : 00000000000000000000000000000000 DEVICE_LUID : 0000000000000000 DEVICE_TYPE : integrated DEVICE_GOPS : {f16:844.8,f32:422.4,i8:422.4,u8:422.4} OPTIMIZATION_CAPABILITIES : FP32 BIN FP16 EXPORT_IMPORT GPU_DEVICE_TOTAL_MEM_SIZE : 3379195904 GPU_UARCH_VERSION : 9.0.0 GPU_EXECUTION_UNITS_COUNT : 24 GPU_MEMORY_STATISTICS : "" MUTABLE PROPERTIES: PERF_COUNT : NO MODEL_PRIORITY : MEDIUM GPU_HOST_TASK_PRIORITY : MEDIUM GPU_QUEUE_PRIORITY : MEDIUM GPU_QUEUE_THROTTLE : MEDIUM GPU_ENABLE_LOOP_UNROLLING : YES GPU_DISABLE_WINOGRAD_CONVOLUTION : NO CACHE_DIR : "" CACHE_MODE : optimize_speed PERFORMANCE_HINT : LATENCY EXECUTION_MODE_HINT : PERFORMANCE COMPILATION_NUM_THREADS : 8 NUM_STREAMS : 1 PERFORMANCE_HINT_NUM_REQUESTS : 0 INFERENCE_PRECISION_HINT : f16 ENABLE_CPU_PINNING : NO DEVICE_ID : 0

Performance issue description

import openvino as ov
import openvino.properties.hint as hints
core = ov.Core()
# in case of Performance
device_property = {
    "GPU": {
        hints.execution_mode: hints.ExecutionMode.PERFORMANCE,
        hints.performance_mode : hints.PerformanceMode.LATENCY,
        hints.inference_precision: ov.Type.f16,
        },
    "CPU": {
        hints.execution_mode: hints.ExecutionMode.PERFORMANCE,
        hints.performance_mode : hints.PerformanceMode.LATENCY,
        hints.inference_precision: ov.Type.f32,
        }
}

core.set_property("HETERO", {"MULTI_DEVICE_PRIORITIES": "GPU,CPU"})
core.set_property("GPU", device_property["GPU"])
core.set_property("CPU", device_property["CPU"])

Step-by-step reproduction

IR Model is based on CodeFormer and compressed to fp16. This issue is only related to configuration settings, irrespective of the model being used or inference.

When running the inference using

compiled_model = core.compile_model(model=model, device_name="CPU")

The time taken is 4.196 secs. and with

compiled_model = core.compile_model(model=model, device_name="GPU")

time taken is 22.595 secs.

Is it because of wrong configuration or integrated GPU's limitations? Any other suggestions to improve the performance? Does API automatically set the best settings based on the device type or one has to set things manually to get the optimal performance?

Cheers

Issue submission checklist

Wan-Intel commented 1 week ago

Hi prashant-saxena, Depending on the model used, device-specific optimizations and network compilations can cause the compile step to be time-consuming, especially with larger models.

OpenVINO™ can cache the model once it is compiled on supported devices and reuse it in later compile_model calls by simply setting a cache folder beforehand:

import time
from pathlib import Path

# Create cache folder
cache_folder = Path("cache")
cache_folder.mkdir(exist_ok=True)

start = time.time()
core = ov.Core()

# Set cache folder
core.set_property({'CACHE_DIR': cache_folder})

# Compile the model as before
model = core.read_model(model=model_path)
compiled_model = core.compile_model(model, device)
print(f"Cache enabled (first time) - compile time: {time.time() - start}s")

For more information, please refer to Model Caching Overview in OpenVINO™ 2024.3 Documentation.

prashant-saxena commented 1 week ago

Hi Wan-Intel,

This post is not about compilation steps or timing but inference time. The inference is taking 4.196 secs on CPU & 22.595 secs on iGPU for the same data (512x512 image). The iGPU is 5 times slower than CPU. Why?

Wan-Intel commented 1 week ago

Could you please share the following information with us for further investigation?

prashant-saxena commented 1 week ago

Download the codeformer onnx model from here

Convert to IR using

import openvino as ov
model = ov.convert_model("models/codeformer.onnx", output=["y"])
ov.save_model(model, "models/codeformer.xml", compress_to_fp16=True)

Test image cropped Test script

#!/usr/bin/env python
# coding: utf-8

# python imports
from time import perf_counter

# pip imports
import numpy as np
import openvino as ov
import openvino.properties.hint as hints
from PIL import Image

# Initialize the OpenVINO runtime core
core = ov.Core()

# in case of Performance
device_property = {
    "GPU": {
        hints.execution_mode: hints.ExecutionMode.PERFORMANCE,
        hints.performance_mode : hints.PerformanceMode.LATENCY,
        hints.inference_precision: ov.Type.f16,
        },
    "CPU": {
        hints.execution_mode: hints.ExecutionMode.PERFORMANCE,
        hints.performance_mode : hints.PerformanceMode.LATENCY,
        hints.inference_precision: ov.Type.f32,
        }
}

core.set_property("HETERO", {"MULTI_DEVICE_PRIORITIES": "GPU,CPU"})
core.set_property("GPU", device_property["GPU"])
core.set_property("CPU", device_property["CPU"])

# Load input image using PIL. Make sure it's 512x512
img = Image.open('cropped.png')
original_size = img.size
img = np.asarray(img)

# Load the network from the IR model files
model = core.read_model(model="models/ir_model/codeformer.xml")

# Compile the model for the CPU/GPU
compiled_model = core.compile_model(model=model, device_name="CPU")

# Create an inference request
infer_request = compiled_model.create_infer_request()

# Preprocess
img = img.astype(np.float32) / 255.0
img = (img - 0.5) / 0.5
img = np.expand_dims(img, axis=0)
img = img.transpose(0, 3, 1, 2)  

w = np.float64(1.0)

# Prepare input dictionary
input_dict = {'x': img, 'w': w,}

# Perform inference
t1_start = perf_counter() 
infer_request.infer(inputs=input_dict)
print(f'Inference Time : {perf_counter()-t1_start:.3f} secs.')

# Get the output
output = infer_request.get_output_tensor(0).data

# Post-process
output_img = output[0].transpose(1, 2, 0)  
output_img = (output_img * 0.5) + 0.5
output_img = (output_img * 255).astype(np.uint8)

# Save using PIL
im = Image.fromarray(output_img)
im.save("output.png")
im.show()

Change device from CPU to GPU & see the time difference

compiled_model = core.compile_model(model=model, device_name="GPU")
Wan-Intel commented 1 week ago

I've inferred the model with the CPU and the GPU plugin and encountered the same issue. Inference time on CPU is 7.533s and inference time on iGPU is 43.617s.

inference result

Let me check with the relevant team and we will update you as soon as possible.