halide / Halide

a language for fast, portable data-parallel computation
https://halide-lang.org
Other
5.84k stars 1.07k forks source link

Using halide with intel opencl runtime driver throws exception "pure virtual method called" with new glibc version 2.38-7 #7885

Open vawale opened 11 months ago

vawale commented 11 months ago

Issue

Running halide programs with intel opencl driver on systems that have newer glibc versions installed results in following error:

pure virtual method called
terminate called without an active exception
Aborted

Reproduction steps

I will walk through the reproduction steps using latest archlinux image from docker and this version of intel-opencl-runtime from AUR.

  1. Get the latest archlinux image from dockerhub:

    docker run --rm -it archlinux:latest bash
  2. Upgrade system:

    
    pacman-key --init
    pacman --noconfirm --sync --refresh archlinux-keyring
    pacman --noconfirm --sync --refresh --refresh --sysupgrade --sysupgrade

pacman --query glibc

I get glibc version `2.38-7`.

3. Install compilers and intel-opencl-runtime driver dependencies:
```sh
pacman --noconfirm --sync --needed clang clang-tools-extra clinfo make lld git sudo numactl intel-tbb wget ocl-icd fakeroot gdb gcc
[[ -x /usr/lib/libtinfo.so.5 ]] || ln -s /usr/lib/libtinfo.so.{6,5}
  1. Install intel-opencl-runtime drivers:
    
    git clone https://aur.archlinux.org/intel-opencl-runtime.git
    chmod a+rw --recursive intel-opencl-runtime/
    pushd intel-opencl-runtime/
    sed --in-place "s/ 'ncurses5-compat-libs'//" PKGBUILD
    sudo --user nobody --preserve-env makepkg
    pacman --upgrade --noconfirm intel-opencl-runtime-1\:18.1.0.015-3-x86_64.pkg.tar.zst
    popd

clinfo

<details>
  <summary>My clinfo output:</summary>

```sh
Number of platforms                               1
  Platform Name                                   Intel(R) CPU Runtime for OpenCL(TM) Applications
  Platform Vendor                                 Intel(R) Corporation
  Platform Version                                OpenCL 2.1 LINUX
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_3d_image_writes cl_intel_exec_by_local_thread cl_khr_spir cl_khr_fp64 cl_khr_image2d_from_buffer cl_intel_vec_len_hint 
  Platform Extensions function suffix             INTEL
  Platform Host timer resolution                  1ns

  Platform Name                                   Intel(R) CPU Runtime for OpenCL(TM) Applications
Number of devices                                 1
  Device Name                                     12th Gen Intel(R) Core(TM) i9-12900K
  Device Vendor                                   Intel(R) Corporation
  Device Vendor ID                                0x8086
  Device Version                                  OpenCL 2.1 (Build 0)
  Driver Version                                  18.1.0.0920
  Device OpenCL C Version                         OpenCL C 2.0 
  Device Type                                     CPU
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               24
  Max clock frequency                             0MHz
  Device Partition                                (core)
    Max number of sub-devices                     24
    Supported partition types                     by counts, equally, by names (Intel)
    Supported affinity domains                    (n/a)
  Max work item dimensions                        3
  Max work item sizes                             8192x8192x8192
  Max work group size                             8192
  Preferred work group size multiple (kernel)     128
  Max sub-groups per work group                   1
  Preferred / native vector sizes                 
    char                                                 1 / 32      
    short                                                1 / 16      
    int                                                  1 / 8       
    long                                                 1 / 4       
    half                                                 0 / 0        (n/a)
    float                                                1 / 8       
    double                                               1 / 4        (cl_khr_fp64)
  Half-precision Floating-point support           (n/a)
  Single-precision Floating-point support         (core)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 No
    Round to infinity                             No
    IEEE754-2008 fused multiply-add               No
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  No
  Double-precision Floating-point support         (cl_khr_fp64)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
  Address bits                                    64, Little-Endian
  Global memory size                              33373507584 (31.08GiB)
  Error Correction support                        No
  Max memory allocation                           8343376896 (7.77GiB)
  Unified memory for Host and Device              Yes
  Shared Virtual Memory (SVM) capabilities        (core)
    Coarse-grained buffer sharing                 Yes
    Fine-grained buffer sharing                   Yes
    Fine-grained system sharing                   Yes
    Atomics                                       Yes
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       1024 bits (128 bytes)
  Preferred alignment for atomics                 
    SVM                                           64 bytes
    Global                                        64 bytes
    Local                                         0 bytes
  Max size for global variable                    65536 (64KiB)
  Preferred total size of global vars             65536 (64KiB)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        1310720 (1.25MiB)
  Global Memory cache line size                   64 bytes
  Image support                                   Yes
    Max number of samplers per kernel             480
    Max size for 1D images from buffer            521461056 pixels
    Max 1D or 2D image array size                 2048 images
    Base address alignment for 2D image buffers   64 bytes
    Pitch alignment for 2D image buffers          64 pixels
    Max 2D image size                             16384x16384 pixels
    Max 3D image size                             2048x2048x2048 pixels
    Max number of read image args                 480
    Max number of write image args                480
    Max number of read/write image args           480
  Max number of pipe args                         16
  Max active pipe reservations                    10922
  Max pipe packet size                            1024
  Local memory type                               Global
  Local memory size                               32768 (32KiB)
  Max number of constant args                     480
  Max constant buffer size                        131072 (128KiB)
  Max size of kernel argument                     3840 (3.75KiB)
  Queue properties (on host)                      
    Out-of-order execution                        Yes
    Profiling                                     Yes
    Local thread execution (Intel)                Yes
  Queue properties (on device)                    
    Out-of-order execution                        Yes
    Profiling                                     Yes
    Preferred size                                4294967295 (4GiB)
    Max size                                      4294967295 (4GiB)
  Max queues on device                            4294967295
  Max events on device                            4294967295
  Prefer user sync for interop                    No
  Profiling timer resolution                      1ns
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            Yes
    Sub-group independent forward progress        No
    IL version                                    SPIR-V_1.0
    SPIR versions                                 1.2
  printf() buffer size                            1048576 (1024KiB)
  Built-in kernels                                (n/a)
  Device Extensions                               cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_3d_image_writes cl_intel_exec_by_local_thread cl_khr_spir cl_khr_fp64 cl_khr_image2d_from_buffer cl_intel_vec_len_hint 

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  Intel(R) CPU Runtime for OpenCL(TM) Applications
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   Success [INTEL]
  clCreateContext(NULL, ...) [default]            Success [INTEL]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  Success (1)
    Platform Name                                 Intel(R) CPU Runtime for OpenCL(TM) Applications
    Device Name                                   12th Gen Intel(R) Core(TM) i9-12900K
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  Success (1)
    Platform Name                                 Intel(R) CPU Runtime for OpenCL(TM) Applications
    Device Name                                   12th Gen Intel(R) Core(TM) i9-12900K
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  Success (1)
    Platform Name                                 Intel(R) CPU Runtime for OpenCL(TM) Applications
    Device Name                                   12th Gen Intel(R) Core(TM) i9-12900K

ICD loader properties
  ICD loader Name                                 OpenCL ICD Loader
  ICD loader Vendor                               OCL Icd free software
  ICD loader Version                              2.3.2
  ICD loader Profile                              OpenCL 3.0

  1. Download halide libraries and headers:

    wget https://github.com/halide/Halide/releases/download/v16.0.0/Halide-16.0.0-x86-64-linux-1e963ff817ef0968cc25d811a25a7350c8953ee6.tar.gz
    mkdir halide_repro
    tar -xvf Halide-16.0.0-x86-64-linux-1e963ff817ef0968cc25d811a25a7350c8953ee6.tar.gz -C halide_repro
  2. Download attached files halide_repro.zip

  3. Compile gradient.cpp to a binary. Executing this binary will create a static library that uses OpenCL target for the function gradient. The specific function or implementation details of this function don't matter for reproduction. This is necessary only to link the actual program with right halide_opencl_device_interface symbols.

    gradient.cpp
#include <Halide.h>

int main() {
    Halide::Func gradient("gradient");
    Halide::Var x, y;
    gradient(x, y) = x + y;

    Halide::Target target = Halide::get_host_target();
    target.set_feature(Halide::Target::OpenCL);

    gradient.compile_to_static_library("libgradient",
                                       gradient.infer_arguments(),
                       "gradient",
                                       target);
    return 0;
}

cd halide_repro
clang++ gradient.cpp -g -I ./Halide-16.0.0-x86-64-linux/include/ -L ./Halide-16.0.0-x86-64-linux/lib/ -lHalide -lpthread -ldl -o gradient -std=c++17
export LD_LIBRARY_PATH=./Halide-16.0.0-x86-64-linux/lib/
./gradient

This should create libgradient.a static library, which defines symbol halide_opencl_device_interface:

nm libgradient.a | grep "halide_opencl_device_interface"
  1. Compile and run sample basic02.cpp that allocates some memory using Halide::Runtime::Buffer::device_malloc.
    basic02.cpp
#include <Halide.h>
#include <HalideBuffer.h>
#include <HalideRuntimeOpenCL.h>

#include <vector>

int main() {
    std::vector<int> sizes{1,1,1};

    auto buffer = Halide::Runtime::Buffer<uint8_t>(nullptr, sizes);
    buffer.device_malloc(halide_opencl_device_interface());

    return 0;
}

clang++ basic02.cpp -g -I ./Halide-16.0.0-x86-64-linux/include/ -L ./Halide-16.0.0-x86-64-linux/lib/ -L . -lHalide -lpthread -ldl -lgradient -o basic02 -std=c++17
  1. Execute the binary basic02, that results in error mentioned in issue:
    ./basic02 
    pure virtual method called
    terminate called without an active exception
    Aborted

Some debugging notes

Get debug symbols for glibc-2.28-7:

pacman --upgrade --noconfirm https://geo.mirror.pkgbuild.com/core-debug/os/x86_64/glibc-debug-2.38-7-x86_64.pkg.tar.zst

Running the program with gdb shows following stack trace on failure:

gdb ./basic02
run

#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
#1  0x00007fefefbe58a3 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at pthread_kill.c:78
#2  0x00007fefefb95668 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007fefefb7d4b8 in __GI_abort () at abort.c:79
#4  0x00007fefefee7a6f in __gnu_cxx::__verbose_terminate_handler () at /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/vterminate.cc:95
#5  0x00007fefefefb11c in __cxxabiv1::__terminate (handler=<optimized out>) at /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:48
#6  0x00007fefefefb189 in std::terminate () at /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:58
#7  0x00007fefefefbec7 in __cxxabiv1::__cxa_pure_virtual () at /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/pure.cc:50
#8  0x00007fefef640870 in ?? () from /opt/intel/opencl-runtime/linux/compiler/lib/intel64_lin/libintelocl.so
#9  0x00007fefef655505 in clFinish () from /opt/intel/opencl-runtime/linux/compiler/lib/intel64_lin/libintelocl.so
#10 0x00007fefefb10f75 in clFinish () from /usr/lib/libOpenCL.so
#11 0x0000562543b7271b in halide_opencl_device_release ()
#12 0x00007feff8cf20e2 in _dl_call_fini (closure_map=closure_map@entry=0x7feff8d252d0) at dl-call_fini.c:43
#13 0x00007feff8cf5d9c in _dl_fini () at dl-fini.c:78
#14 0x00007fefefb97cc6 in __run_exit_handlers (status=0, listp=0x7fefefd2f680 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:111
#15 0x00007fefefb97e10 in __GI_exit (status=<optimized out>) at exit.c:141
#16 0x00007fefefb7ecd7 in __libc_start_call_main (main=main@entry=0x562543b63630 <main()>, argc=argc@entry=1, argv=argv@entry=0x7ffd60882758) at ../sysdeps/nptl/libc_start_call_main.h:74
#17 0x00007fefefb7ed8a in __libc_start_main_impl (main=0x562543b63630 <main()>, argc=1, argv=0x7ffd60882758, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, 
    stack_end=0x7ffd60882748) at ../csu/libc-start.c:360
#18 0x0000562543b63525 in _start ()

Based on stack trace, the issue is caught only by newer glibc versions probably because of improvements made in https://github.com/bminor/glibc/commit/6985865bc3ad5b23147ee73466583dd7fdf65892

abadams commented 11 months ago

This sounds like a change in global destructor order has triggered a bug inside the intel opencl driver. I can't think of anything that we do that could cause it to call a pure virtual function on one of its internal data structures. If we are doing something wrong with the OpenCL API, it's supposed to return an error code, not crash.

vawale commented 11 months ago

Yes, I think you are right. I do not get this issue if I use intel-compute-runtime drivers, but those support only GPU device type. I also do not get this issue with pocl opencl implementation that supports CPU device type.

I will report this bug to the maintainers of https://www.intel.com/content/www/us/en/developer/articles/tool/opencl-drivers.html. Thanks for your help :)

vawale commented 11 months ago

Btw, from stack trace it looks like halide_opencl_device_release function called after main exits. Why is call to clFinish made after main exits? Is it called for any objects with static storage duration?

abadams commented 11 months ago

Yes, it's releasing any compiled shader programs, the command queue, and the context. Maybe we shouldn't do that.