openvinotoolkit / openvino

OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference
https://docs.openvino.ai
Apache License 2.0
6.85k stars 2.18k forks source link

[Bug]: OpenVINO as backend to Pytorch - integrated GPU works but discreet GPU gives NANs #25571

Open js333031 opened 2 months ago

js333031 commented 2 months ago

OpenVINO Version

2024.2.0

Operating System

Other (Please specify in description)

Device used for inference

GPU

Framework

PyTorch

Model used

No response

Issue description

Using OpenVINO 2024.2.0 as backend to Pytorch (intel_extension_for_pytorch 2.1.30.post0) Python 3.10 within conda forge enviornment on Ubuntu 24.04 oneAPI 2024.1 12th Gen NUC with 12th Gen Intel(R) Core(TM) i7-12700H CPU and A770m GPU

Hello query device output from the env:

(py310) user@NUC12SNKi72:~$ python3 /usr/share/openvino/samples/python/hello_query_device/hello_query_device.py
[ INFO ] Available devices:
[ INFO ] CPU :
[ INFO ]        SUPPORTED_PROPERTIES:
[ INFO ]                AVAILABLE_DEVICES:
[ INFO ]                RANGE_FOR_ASYNC_INFER_REQUESTS: 1, 1, 1
[ INFO ]                RANGE_FOR_STREAMS: 1, 20
[ INFO ]                EXECUTION_DEVICES: CPU
[ INFO ]                FULL_DEVICE_NAME: 12th Gen Intel(R) Core(TM) i7-12700H
[ INFO ]                OPTIMIZATION_CAPABILITIES: FP32, INT8, BIN, EXPORT_IMPORT
[ INFO ]                DEVICE_TYPE: Type.INTEGRATED
[ INFO ]                DEVICE_ARCHITECTURE: intel64
[ INFO ]                NUM_STREAMS: 1
[ INFO ]                INFERENCE_NUM_THREADS: 0
[ INFO ]                PERF_COUNT: False
[ INFO ]                INFERENCE_PRECISION_HINT: <Type: 'float32'>
[ INFO ]                PERFORMANCE_HINT: PerformanceMode.LATENCY
[ INFO ]                EXECUTION_MODE_HINT: ExecutionMode.PERFORMANCE
[ INFO ]                PERFORMANCE_HINT_NUM_REQUESTS: 0
[ INFO ]                ENABLE_CPU_PINNING: True
[ INFO ]                SCHEDULING_CORE_TYPE: SchedulingCoreType.ANY_CORE
[ INFO ]                MODEL_DISTRIBUTION_POLICY: set()
[ INFO ]                ENABLE_HYPER_THREADING: True
[ INFO ]                DEVICE_ID:
[ INFO ]                CPU_DENORMALS_OPTIMIZATION: False
[ INFO ]                LOG_LEVEL: Level.NO
[ INFO ]                CPU_SPARSE_WEIGHTS_DECOMPRESSION_RATE: 1.0
[ INFO ]                DYNAMIC_QUANTIZATION_GROUP_SIZE: 0
[ INFO ]                KV_CACHE_PRECISION: <Type: 'float16'>
[ INFO ]                AFFINITY: Affinity.HYBRID_AWARE
[ INFO ]
[ INFO ] GPU.0 :
[ INFO ]        SUPPORTED_PROPERTIES:
[ INFO ]                AVAILABLE_DEVICES: 0, 1
[ INFO ]                RANGE_FOR_ASYNC_INFER_REQUESTS: 1, 2, 1
[ INFO ]                RANGE_FOR_STREAMS: 1, 2
[ INFO ]                OPTIMAL_BATCH_SIZE: 1
[ INFO ]                MAX_BATCH_SIZE: 1
[ INFO ]                DEVICE_ARCHITECTURE: GPU: vendor=0x8086 arch=v12.3.0
[ INFO ]                FULL_DEVICE_NAME: Intel(R) Iris(R) Xe Graphics (iGPU)
[ INFO ]                DEVICE_UUID: 8680a6460c0000000002000000000000
[ INFO ]                DEVICE_LUID: 0200000000000000
[ INFO ]                DEVICE_TYPE: Type.INTEGRATED
[ INFO ]                DEVICE_GOPS: {<Type: 'float16'>: 4300.7998046875, <Type: 'float32'>: 2150.39990234375, <Type: 'int8_t'>: 8601.599609375, <Type: 'uint8_t'>: 8601.599609375}
[ INFO ]                OPTIMIZATION_CAPABILITIES: FP32, BIN, FP16, INT8, EXPORT_IMPORT
[ INFO ]                GPU_DEVICE_TOTAL_MEM_SIZE: 14863626240
[ INFO ]                GPU_UARCH_VERSION: 12.3.0
[ INFO ]                GPU_EXECUTION_UNITS_COUNT: 96
[ INFO ]                GPU_MEMORY_STATISTICS: {}
[ INFO ]                PERF_COUNT: False
[ INFO ]                MODEL_PRIORITY: Priority.MEDIUM
[ INFO ]                GPU_HOST_TASK_PRIORITY: Priority.MEDIUM
[ INFO ]                GPU_QUEUE_PRIORITY: Priority.MEDIUM
[ INFO ]                GPU_QUEUE_THROTTLE: Priority.MEDIUM
[ INFO ]                GPU_ENABLE_SDPA_OPTIMIZATION: True
[ INFO ]                GPU_ENABLE_LOOP_UNROLLING: True
[ INFO ]                GPU_DISABLE_WINOGRAD_CONVOLUTION: False
[ INFO ]                CACHE_DIR:
[ INFO ]                CACHE_MODE: CacheMode.OPTIMIZE_SPEED
[ INFO ]                PERFORMANCE_HINT: PerformanceMode.LATENCY
[ INFO ]                EXECUTION_MODE_HINT: ExecutionMode.PERFORMANCE
[ INFO ]                COMPILATION_NUM_THREADS: 20
[ INFO ]                NUM_STREAMS: 1
[ INFO ]                PERFORMANCE_HINT_NUM_REQUESTS: 0
[ INFO ]                INFERENCE_PRECISION_HINT: <Type: 'float16'>
[ INFO ]                ENABLE_CPU_PINNING: False
[ INFO ]                DEVICE_ID: 0
[ INFO ]
[ INFO ] GPU.1 :
[ INFO ]        SUPPORTED_PROPERTIES:
[ INFO ]                AVAILABLE_DEVICES: 0, 1
[ INFO ]                RANGE_FOR_ASYNC_INFER_REQUESTS: 1, 2, 1
[ INFO ]                RANGE_FOR_STREAMS: 1, 2
[ INFO ]                OPTIMAL_BATCH_SIZE: 1
[ INFO ]                MAX_BATCH_SIZE: 1
[ INFO ]                DEVICE_ARCHITECTURE: GPU: vendor=0x8086 arch=v12.55.8
[ INFO ]                FULL_DEVICE_NAME: Intel(R) Arc(TM) A770M Graphics (dGPU)
[ INFO ]                DEVICE_UUID: 86809056080000000300000000000000
[ INFO ]                DEVICE_LUID: 0200000000000000
[ INFO ]                DEVICE_TYPE: Type.DISCRETE
[ INFO ]                DEVICE_GOPS: {<Type: 'float16'>: 0.0, <Type: 'float32'>: 16793.599609375, <Type: 'int8_t'>: 0.0, <Type: 'uint8_t'>: 0.0}
[ INFO ]                OPTIMIZATION_CAPABILITIES: FP32, BIN, FP16, INT8, GPU_HW_MATMUL, EXPORT_IMPORT
[ INFO ]                GPU_DEVICE_TOTAL_MEM_SIZE: 16225243136
[ INFO ]                GPU_UARCH_VERSION: 12.55.8
[ INFO ]                GPU_EXECUTION_UNITS_COUNT: 512
[ INFO ]                GPU_MEMORY_STATISTICS: {}
[ INFO ]                PERF_COUNT: False
[ INFO ]                MODEL_PRIORITY: Priority.MEDIUM
[ INFO ]                GPU_HOST_TASK_PRIORITY: Priority.MEDIUM
[ INFO ]                GPU_QUEUE_PRIORITY: Priority.MEDIUM
[ INFO ]                GPU_QUEUE_THROTTLE: Priority.MEDIUM
[ INFO ]                GPU_ENABLE_SDPA_OPTIMIZATION: True
[ INFO ]                GPU_ENABLE_LOOP_UNROLLING: True
[ INFO ]                GPU_DISABLE_WINOGRAD_CONVOLUTION: False
[ INFO ]                CACHE_DIR:
[ INFO ]                CACHE_MODE: CacheMode.OPTIMIZE_SPEED
[ INFO ]                PERFORMANCE_HINT: PerformanceMode.LATENCY
[ INFO ]                EXECUTION_MODE_HINT: ExecutionMode.PERFORMANCE
[ INFO ]                COMPILATION_NUM_THREADS: 20
[ INFO ]                NUM_STREAMS: 1
[ INFO ]                PERFORMANCE_HINT_NUM_REQUESTS: 0
[ INFO ]                INFERENCE_PRECISION_HINT: <Type: 'float16'>
[ INFO ]                ENABLE_CPU_PINNING: False
[ INFO ]                DEVICE_ID: 1
[ INFO ]
(py310) user@NUC12SNKi72:~$

Step-by-step reproduction

In code below, if you change torch.compile() line from GPU.1 to GPU.0, real numbers are printed in prediction. If GPU.1 is used, NANs are printed.

(py310) user@NUC12SNKi72:~$ cat test.py

import torch
import intel_extension_for_pytorch as ipex
import torchvision.models as models
import openvino.torch

model = models.resnet50(weights="ResNet50_Weights.DEFAULT")
model.eval()
data = torch.rand(1, 3, 224, 224)
model = torch.compile(model, backend="openvino", options = {"device" : "GPU.1", "model_caching" : True, "cache_dir": "./model_cache"})
#model = torch.compile(model, backend="openvino", options = {"device" : "CPU"})\n')

model = model.to("xpu")
data = data.to("xpu")

data = torch.rand((1,3,224,224))

print("Input data shape: ", data.shape)

dtype=torch.bfloat16
data=data.to('xpu')

pred=model(data)

print("Prediction: ", pred)

(py310) user@NUC12SNKi72:~$

Relevant log output

No response

Issue submission checklist

js333031 commented 6 days ago

Any updates?