[Bug]: Matmul error on GPU plugin and batch size 16

dkurt commented 1 year ago

OpenVINO Version

2023.1.0.dev20230811 (pip)

Operating System

Ubuntu 20.04 (LTS)

Device used for inference

GPU

Framework

None

Model used

No response

Issue description

Can be reproduced with nGraph C++ api also. Below is a PyTorch reproducer for reference. The error around 0.015 is ok but for batch size 16 it's absolutely incorrect (5.251 comparing to PyTorch).

Step-by-step reproduction

import numpy as np
import torch
import torch.nn as nn
import openvino
from openvino.runtime import Core

print('PyTorch version', torch.__version__)
print('OV version', openvino.runtime.__version__)

torch.manual_seed(123)
np.random.seed(141)

class Model(nn.Module):
    def __init__(self, *args, **kwargs):
        super().__init__()
        self.weights = torch.tensor(np.random.standard_normal((60, 3))).float()

    def forward(self, x):
        return torch.matmul(x, self.weights)

m = Model()
for batch_size in [1, 2, 4, 8, 16]:
    inp = torch.tensor(np.random.standard_normal((batch_size, 60))).float()
    ref = m(inp).detach().numpy()

    torch.onnx.export(m, inp, "model.onnx")

    # Run with OpenVINO

    core = Core()
    compiled = core.compile_model("model.onnx", "GPU")
    req = compiled.create_infer_request()
    out = req.infer(np.array(inp))
    out = next(iter(out.values()))
    print(f"Batch size {batch_size} diff: {np.max(np.abs(ref - out))}")

Relevant log output

PyTorch version 1.13.1+cpu
OV version 2023.1.0-12050-e33de350633
Batch size 1 diff: 0.0032215118408203125
Batch size 2 diff: 0.040172576904296875
Batch size 4 diff: 0.015358924865722656
Batch size 8 diff: 0.040701866149902344
Batch size 16 diff: 5.25123929977417



### Issue submission checklist

- [X] I report the issue. It's not a question
- [X] I checked the problem with the documentation, FAQ, open issues, Stack Overflow, etc., and have not found the solution
- [X] There is reproducer code and related data files such as images, videos, models, etc.

vurusovs commented 1 year ago

@vladimir-paramuzov @p-durandin please, take a look

sshlyapn commented 1 year ago

Hi @dkurt, could you please provide clinfo application logs to help us identify your OpenCL driver version and GPU model?

dkurt commented 1 year ago

@sshlyapn,

clinfo

``` Number of platforms 1 Platform Name Intel(R) OpenCL HD Graphics Platform Vendor Intel(R) Corporation Platform Version OpenCL 3.0 Platform Profile FULL_PROFILE Platform Extensions cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_icd cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_intel_command_queue_families cl_intel_subgroups cl_intel_required_subgroup_size cl_intel_subgroups_short cl_khr_spir cl_intel_accelerator cl_intel_driver_diagnostics cl_khr_priority_hints cl_khr_throttle_hints cl_khr_create_command_queue cl_intel_subgroups_char cl_intel_subgroups_long cl_khr_il_program cl_intel_mem_force_host_memory cl_khr_subgroup_extended_types cl_khr_subgroup_non_uniform_vote cl_khr_subgroup_ballot cl_khr_subgroup_non_uniform_arithmetic cl_khr_subgroup_shuffle cl_khr_subgroup_shuffle_relative cl_khr_subgroup_clustered_reduce cl_intel_device_attribute_query cl_khr_suggested_local_work_size cl_intel_spirv_media_block_io cl_intel_spirv_subgroups cl_khr_spirv_no_integer_wrap_decoration cl_intel_unified_shared_memory_preview cl_khr_mipmap_image cl_khr_mipmap_image_writes cl_intel_planar_yuv cl_intel_packed_yuv cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_image2d_from_buffer cl_khr_depth_images cl_khr_3d_image_writes cl_intel_media_block_io cl_intel_va_api_media_sharing cl_intel_sharing_format_query cl_khr_pci_bus_info cl_intel_subgroup_local_block_io Platform Host timer resolution 1ns Platform Extensions function suffix INTEL Platform Name Intel(R) OpenCL HD Graphics Number of devices 1 Device Name Intel(R) Graphics [0x9a49] Device Vendor Intel(R) Corporation Device Vendor ID 0x8086 Device Version OpenCL 3.0 NEO Driver Version 21.48.21782 Device OpenCL C Version OpenCL C 1.2 Device Type GPU Device Profile FULL_PROFILE Device Available Yes Compiler Available Yes Linker Available Yes Max compute units 80 Max clock frequency 1300MHz Device Partition (core) Max number of sub-devices 0 Supported partition types None Supported affinity domains (n/a) Max work item dimensions 3 Max work item sizes 256x256x256 Max work group size 256 Preferred work group size multiple 64 Max sub-groups per work group 32 Sub-group sizes (Intel) 8, 16, 32 Preferred / native vector sizes char 16 / 16 short 8 / 8 int 4 / 4 long 1 / 1 half 8 / 8 (cl_khr_fp16) float 1 / 1 double 1 / 1 (n/a) Half-precision Floating-point support (cl_khr_fp16) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero Yes Round to infinity Yes IEEE754-2008 fused multiply-add Yes Support is emulated in software No Single-precision Floating-point support (core) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero Yes Round to infinity Yes IEEE754-2008 fused multiply-add Yes Support is emulated in software No Correctly-rounded divide and sqrt operations No Double-precision Floating-point support (n/a) Address bits 64, Little-Endian Global memory size 6745325568 (6.282GiB) Error Correction support No Max memory allocation 1073741824 (1024MiB) Unified memory for Host and Device Yes Shared Virtual Memory (SVM) capabilities (core) Coarse-grained buffer sharing Yes Fine-grained buffer sharing No Fine-grained system sharing No Atomics No Minimum alignment for any data type 128 bytes Alignment of base address 1024 bits (128 bytes) Preferred alignment for atomics SVM 64 bytes Global 64 bytes Local 64 bytes Max size for global variable 65536 (64KiB) Preferred total size of global vars 1073741824 (1024MiB) Global Memory cache type Read/Write Global Memory cache size 786432 (768KiB) Global Memory cache line size 64 bytes Image support Yes Max number of samplers per kernel 16 Max size for 1D images from buffer 67108864 pixels Max 1D or 2D image array size 2048 images Base address alignment for 2D image buffers 4 bytes Pitch alignment for 2D image buffers 4 pixels Max 2D image size 16384x16384 pixels Max planar YUV image size 16384x16352 pixels Max 3D image size 2048x2048x2048 pixels Max number of read image args 128 Max number of write image args 128 Max number of read/write image args 128 Max number of pipe args 0 Max active pipe reservations 0 Max pipe packet size 0 Local memory type Local Local memory size 65536 (64KiB) Max number of constant args 8 Max constant buffer size 1073741824 (1024MiB) Max size of kernel argument 2048 (2KiB) Queue properties (on host) Out-of-order execution Yes Profiling Yes Queue properties (on device) Out-of-order execution No Profiling No Preferred size 0 Max size 0 Max queues on device 0 Max events on device 0 Prefer user sync for interop Yes Profiling timer resolution 52ns Execution capabilities Run OpenCL kernels Yes Run native kernels No Sub-group independent forward progress No IL version SPIR-V_1.2 SPIR versions 1.2 printf() buffer size 4194304 (4MiB) Built-in kernels (n/a) Device Extensions cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_icd cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_intel_command_queue_families cl_intel_subgroups cl_intel_required_subgroup_size cl_intel_subgroups_short cl_khr_spir cl_intel_accelerator cl_intel_driver_diagnostics cl_khr_priority_hints cl_khr_throttle_hints cl_khr_create_command_queue cl_intel_subgroups_char cl_intel_subgroups_long cl_khr_il_program cl_intel_mem_force_host_memory cl_khr_subgroup_extended_types cl_khr_subgroup_non_uniform_vote cl_khr_subgroup_ballot cl_khr_subgroup_non_uniform_arithmetic cl_khr_subgroup_shuffle cl_khr_subgroup_shuffle_relative cl_khr_subgroup_clustered_reduce cl_intel_device_attribute_query cl_khr_suggested_local_work_size cl_intel_spirv_media_block_io cl_intel_spirv_subgroups cl_khr_spirv_no_integer_wrap_decoration cl_intel_unified_shared_memory_preview cl_khr_mipmap_image cl_khr_mipmap_image_writes cl_intel_planar_yuv cl_intel_packed_yuv cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_image2d_from_buffer cl_khr_depth_images cl_khr_3d_image_writes cl_intel_media_block_io cl_intel_va_api_media_sharing cl_intel_sharing_format_query cl_khr_pci_bus_info cl_intel_subgroup_local_block_io NULL platform behavior clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) Intel(R) OpenCL HD Graphics clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) Success [INTEL] clCreateContext(NULL, ...) [default] Success [INTEL] clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) Success (1) Platform Name Intel(R) OpenCL HD Graphics Device Name Intel(R) Graphics [0x9a49] clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No devices found in platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) Success (1) Platform Name Intel(R) OpenCL HD Graphics Device Name Intel(R) Graphics [0x9a49] clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No devices found in platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) Success (1) Platform Name Intel(R) OpenCL HD Graphics Device Name Intel(R) Graphics [0x9a49] ICD loader properties ICD loader Name OpenCL ICD Loader ICD loader Vendor OCL Icd free software ICD loader Version 2.2.11 ICD loader Profile OpenCL 2.1 NOTE: your OpenCL library only supports OpenCL 2.1, but some installed platforms support OpenCL 3.0. Programs using 3.0 features may crash or behave unexpectedly ```

sshlyapn commented 1 year ago

@dkurt thank you!

There was a bug in FC kernel, so I prepared a fix for this issue

Please note, that by default GPU Plugin uses EXECUTION_MODE_HINT = PERFORMANCE, so all calculations are performed in FP16 precision for this model, which may affect final results as well (you can override this property with ACCURACY value if needed)

openvinotoolkit / openvino