IPEX DPC++ extension ImportError: libtbb.so.12: cannot open shared object file

troy818 commented 1 year ago

Describe the bug

Hi, I'm trying to run a Pytorch program with the IPEX DPC++ extension. And I face this error when using JIT Compiling Extensions.

*/intel_extension_for_pytorch/xpu/cpp_extension.py", line 848, in _import_module_from_library
    module = importlib.util.module_from_spec(spec)
ImportError: libtbb.so.12: cannot open shared object file: No such file or directory

Then, I source oneapi/tbb (version 2023.0) and want to solve this error, but I get another error:

RuntimeError: No kernel named _ZTSZZ11unpack_infoN2at6TensorEiENKUlRN4sycl3_V17handlerEE_clES4_EUlNS2_7nd_itemILi1EEEE_ was found -46 (PI_ERROR_INVALID_KERNEL_NAME)

My sycl code is blow:

torch::Tensor unpack_info(const torch::Tensor packed_info, const int n_samples)
{
    DEVICE_GUARD(packed_info);
    CHECK_INPUT(packed_info);

    const int n_rays = packed_info.size(0);
    const int threads = 256;
    const int blocks = CUDA_N_BLOCKS_NEEDED(n_rays, threads);

    // int n_samples = packed_info[n_rays - 1].sum(0).item<int>();
    torch::Tensor ray_indices = torch::empty(
        {n_samples}, packed_info.options().dtype(torch::kLong));

    // submit kernel
    auto device_type = c10::DeviceType::XPU;
    c10::impl::VirtualGuardImpl impl(device_type);
    c10::Stream dpcpp_stream = impl.getStream(c10::Device(device_type));
    auto queue = xpu::get_queue_from_stream(dpcpp_stream);

    const int* packed_info_data = packed_info.data_ptr<int>();
    int64_t* ray_indices_data = ray_indices.data_ptr<int64_t>();

    queue.submit([&](sycl::handler &cgh){
        cgh.parallel_for(
            sycl::nd_range<1>(sycl::range<1>(blocks * threads), sycl::range<1>(threads)),
            [=](sycl::nd_item<1> item_ct1) {
                unpack_info_kernel(
                    n_rays,
                    packed_info_data,
                    ray_indices_data,
                    item_ct1);
            }
        );
    });

    return ray_indices;
}

void unpack_info_kernel(
    // input
    const int n_rays,
    const int *packed_info,
    // output
    int64_t *ray_indices,
    sycl::nd_item<1> item_ct1)
{
    CUDA_GET_THREAD_ID(i, n_rays);

    // locate
    const int base = packed_info[i * 2 + 0];  // point idx start.
    const int steps = packed_info[i * 2 + 1]; // point idx shift.
    if (steps == 0)
        return;

    ray_indices += base;

    for (int j = 0; j < steps; ++j)
    {
        ray_indices[j] = i;
    }
}

Reference: https://github.com/intel/llvm/issues/6421 https://github.com/intel/intel-extension-for-pytorch/issues/330

Versions

The versions is :

Collecting environment information...
PyTorch version: 1.13.0a0+gitb1dde16
PyTorch CXX11 ABI: Yes
IPEX version: 1.13.10+xpu
IPEX commit: 7d85b0e92
Build type: Release

OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: N/A
IGC version: 2023.0.0 (2023.0.0.20221201)
CMake version: version 3.23.2
Libc version: glibc-2.31

Python version: 3.8.10 (default, Jun 22 2022, 20:18:18)  [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.14.0-1047-oem-x86_64-with-glibc2.29
Is XPU available: True
DPCPP runtime version: 2023.0.0
MKL version: 2023.0.0
GPU models and configuration:
[0] _DeviceProperties(name='Intel(R) Graphics [0x5690]', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=0, total_memory=15473MB, max_compute_units=512)
[1] _DeviceProperties(name='Intel(R) Graphics [0x4626]', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=0, total_memory=25348MB, max_compute_units=96)
Intel OpenCL ICD version: 22.24.23453+i392~u20.04
Level Zero version: 1.3.23453+i392~u20.04

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   39 bits physical, 48 bits virtual
CPU(s):                          20
On-line CPU(s) list:             0-19
Thread(s) per core:              1
Core(s) per socket:              14
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           154
Model name:                      12th Gen Intel(R) Core(TM) i7-12700H
Stepping:                        3
CPU MHz:                         2700.000
CPU max MHz:                     4700.0000
CPU min MHz:                     400.0000
BogoMIPS:                        5376.00
Virtualization:                  VT-x
L1d cache:                       336 KiB
L1i cache:                       224 KiB
L2 cache:                        8.8 MiB
NUMA node0 CPU(s):               0-19
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req umip pku ospke waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize arch_lbr flush_l1d arch_capabilities

Versions of relevant libraries:
[pip3] intel-extension-for-pytorch==1.13.10+xpu
[pip3] msgpack-numpy==0.4.8
[pip3] numpy==1.24.2
[pip3] torch==1.13.0a0+gitb1dde16
[pip3] torchvision==0.14.1a0+0504df5
[conda] mkl                       2022.2.0             intel_8748    file:///opt/intel/oneapi/conda_channel
[conda] mkl-dpcpp                 2022.2.0             intel_8748    file:///opt/intel/oneapi/conda_channel
[conda] mkl-service               2.4.0           py39h7634626_12    file:///opt/intel/oneapi/conda_channel
[conda] mkl_fft                   1.3.1           py39h1909d4f_16    file:///opt/intel/oneapi/conda_channel
[conda] mkl_random                1.2.2           py39h94ca54a_16    file:///opt/intel/oneapi/conda_channel
[conda] mkl_umath                 0.1.1           py39h0348192_26    file:///opt/intel/oneapi/conda_channel
[conda] numpy                     1.21.4          py39h8dc10e9_16    file:///opt/intel/oneapi/conda_channel
[conda] numpy-base                1.21.4          py39h97bc315_16    file:///opt/intel/oneapi/conda_channel

Thanks!

gujinghui commented 1 year ago

@jingxu10 @xuhancn

Please take a look.

Why the TBB is needed here?
Why the kernel is lost?

jingxu10 commented 1 year ago

when you compiled your sycl code, did you source full oneapi package environment or just dpcpp compiler and mkl environments?

troy818 commented 1 year ago

Hi @jingxu10, At first time, I just source dpcpp and mkl, and I got the "libtbb.so.12" error. And after I source full oneapi package environment when compiling sycl code, I got the "No kernel named" error.

jingxu10 commented 1 year ago

Is there a reproducer that we can test at our side?

troy818 commented 1 year ago

Hi @jingxu10, Here is my code: https://github.com/troy818/nerfacc_dpcpp/tree/ipex-xpu It uses set.py to build and install, and the dpcpp kernel is in csrc. For the test, you can use packet_build_install.sh

huiyan2021 commented 1 year ago

Hi @troy818 Could you try to change below 2 "icpx" to "dpcpp" in your local intel_extension_for_pytorch/xpu/cpp_extension.py and run packet_build_install.sh again?

troy818 commented 1 year ago

Hi @huiyan2021, I'm using ipex 1.13.10+xpu, and my local intel_extension_for_pytorch/xpu/cpp_extension.py is just "dpcpp". I also try this using ipex 1.13.120+xpu The issue is the same, no matter using "dpcpp" or "icpx".

huiyan2021 commented 1 year ago

change this line: https://github.com/troy818/nerfacc_dpcpp/blob/ipex-xpu/setup.py#L45 to: cmdclass={"build_ext": DpcppBuildExtension.with_options(use_ninja=False)} if not BUILD_NO_DPCPP else {},

troy818 commented 1 year ago

Hi huiyan2021, I've tried to change the cmdclass, but the issue is still the same.

huiyan2021 commented 1 year ago

@troy818 Did you change https://github.com/troy818/nerfacc_dpcpp/blob/ipex-xpu/nerfacc/pack.py#L10?

# import nerfacc.cuda as _C
import nerfacc.csrc as _C

also, please remove build folder and run packet_build_install.sh again

This is the output at my side:

troy818 commented 1 year ago

Hi @huiyan2021 , The key is the use_ninja=False right? No matter with or without the change from import nerfacc.cuda to import nerfacc.csrc, if you don't set use_ninja to false, the kernel lost will happen. (I foget to remove build folder for the code change)

huiyan2021 commented 1 year ago

Need below 2 walkarounds at the time being:

use_ninja=False
use 'dpcpp' instead of 'icpx'

troy818 commented 1 year ago

Hi @huiyan2021 , Agree and thanks for your help.

intel / intel-extension-for-pytorch

IPEX DPC++ extension ImportError: libtbb.so.12: cannot open shared object file #337

Describe the bug

Versions