No visible XPU devices when running intel-extension-for-tensorflow/tree/main/examples/train_maskrcnn on PVC

ch0801 commented 4 months ago

Failed to run intel-extension-for-tensorflow/tree/main/examples/train_maskrcnn on PVC (GPU Max 1550). Errors below showing "Can not found any devices." and " Failed precondition: No visible XPU devices". However I could run intel-extension-for-tensorflow/tree/main/examples/infer_resnet50 successfully on PVC. Also verified with "sycl-ls" that gpu 0 -7 existed.

2024-07-15 14:24:41.385979: I itex/core/wrapper/itex_gpu_wrapper.cc:38] Intel Extension for Tensorflow* GPU backend is loaded. 2024-07-15 14:24:41.387096: I external/local_xla/xla/pjrt/pjrt_api.cc:67] PJRT_Api is set for device type xpu 2024-07-15 14:24:41.387116: I external/local_xla/xla/pjrt/pjrt_api.cc:72] PJRT plugin for XPU has PJRT API version 0.33. The framework PJRT API version is 0.34. 2024-07-15 14:24:41.404971: E external/intel_xla/xla/stream_executor/sycl/sycl_gpu_runtime.cc:178] Can not found any devices. 2024-07-15 14:24:41.405019: E itex/core/kernels/xpu_kernel.cc:60] Failed precondition: No visible XPU devices. To check runtime environment on your host, please run itex/tools/python/env_check.py. If you need help, create an issue at https://github.com/intel/intel-extension-for-tensorflow/issues 2024-07-15 14:24:41.450735: E itex/core/devices/gpu/itex_gpu_runtime.cc:174] Can not found any devices. To check runtime environment on your host, please run itex/tools/python/env_check.py. If you need help, create an issue at https://github.com/intel/intel-extension-for-tensorflow/issues 2024-07-15 14:24:41,895 I dllogger PARAMETER image_size: (832, 1344), augment_input_data: True, gt_mask_size: 112, num_classes: 91, skip_crowd_during_training: True, use_category: True, rpn_positive_overlap: 0.7, rpn_negative_overlap: 0.3, rpn_batch_size_per_im: 256, rpn_fg_fraction: 0.5, rpn_min_size: 0.0, batch_size_per_im: 512, fg_fraction: 0.25, fg_thresh: 0.5, bg_thresh_hi: 0.5, bg_thresh_lo: 0.0, fast_rcnn_mlp_head_dim: 1024, bbox_reg_weights: (10.0, 10.0, 5.0, 5.0), include_mask: True, mrcnn_resolution: 28, train_rpn_pre_nms_topn: 2000, train_rpn_post_nms_topn: 1000, train_rpn_nms_threshold: 0.7, test_detections_per_image: 100, test_nms: 0.5, test_rpn_pre_nms_topn: 1000, test_rpn_post_nms_topn: 1000, test_rpn_nms_thresh: 0.7, min_level: 2, max_level: 6, num_scales: 1, aspect_ratios: [(1.0, 1.0), (1.4, 0.7), (0.7, 1.4)], anchor_scale: 8.0, rpn_box_loss_weight: 1.0, fast_rcnn_box_loss_weight: 1.0, mrcnn_weight_loss_mask: 1.0, checkpoint_name_format: nvidia_mrcnn_tf2.ckpt, mode: train, mpi_num: 0, data_dir: /nfs/site/home/chinglan/github/int-ext-tf-public/examples/train_maskrcnn/DeepLearningExamples/TensorFlow2/Segmentation/MaskRCNN/dataset/data, model_dir: /nfs/site/home/chinglan/github/int-ext-tf-public/examples/train_maskrcnn/DeepLearningExamples/TensorFlow2/Segmentation/MaskRCNN/output, backbone_checkpoint: None, eval_file: /data/annotations/instances_val2017.json, epochs: 12, steps_per_epoch: None, eval_samples: None, train_batch_size: 4, eval_batch_size: 8, seed: None, l2_weight_decay: 0.0001, init_learning_rate: 0.0, learning_rate_values: [0.01, 0.001, 0.0001], learning_rate_boundaries: [0.3, 8.0, 10.0], momentum: 0.9, finetune_bn: False, use_synthetic_data: False, xla: False, amp: False, log_file: mrcnn-dlll.json, log_every: 100, log_warmup_steps: 100, log_graph: False, log_tensorboard: None, verbose: False, eagerly: False 2024:07:15-14:24:42:(40388) |CCL_WARN| did not find MPI-launcher specific variables, switch to ATL/OFI, to force enable ATL/MPI set CCL_ATL_TRANSPORT=mpi 2024:07:15-14:24:42:(40388) |CCL_WARN| CCL_CONFIGURATION_PATH= is unknown to and unused by oneCCL code but is present in the environment, check if it is not mistyped. 2024:07:15-14:24:42:(40388) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL Traceback (most recent call last): File "/nfs/site/home/chinglan/github/int-ext-tf-public/examples/train_maskrcnn/DeepLearningExamples/TensorFlow2/Segmentation/MaskRCNN/main.py", line 83, in main() File "/nfs/site/home/chinglan/github/int-ext-tf-public/examples/train_maskrcnn/DeepLearningExamples/TensorFlow2/Segmentation/MaskRCNN/main.py", line 68, in main tf.config.experimental.set_visible_devices(xpus[hvd.local_rank()], 'XPU') IndexError: list index out of range

wangkl2 commented 4 months ago

@ch0801 E external/intel_xla/xla/stream_executor/sycl/sycl_gpu_runtime.cc:178] Can not found any devices. From the output, the XPU devices are not detected via the framework. Could you please run the following commands to collect the env info in your conda env, and provide us with the output? Thanks.

wget https://raw.githubusercontent.com/intel/intel-extension-for-tensorflow/main/tools/python/env_check.py
pip install wget
python env_check.py

ch0801 commented 3 months ago

python virtual environment. (Anaconda Use is No Longer Allowed Within Intel)

(rcnn2) chinglan@sdp716090:~/venv_test/rcnn2$ python env_check.py

Check Environment for Intel(R) Extension for TensorFlow*...

__file__:     /nfs/site/home/chinglan/venv_test/rcnn2/env_check.py
100% [................................................................................] 7091 / 7091Check Python
         Python 3.9.16 is Supported.
Check Python Passed

Check OS
        OS ubuntu:22.04 is Supported
Check OS Passed

Check Tensorflow
2024-07-17 08:35:45.348873: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-07-17 08:35:45.351659: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-07-17 08:35:45.379673: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-17 08:35:45.379694: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-17 08:35:45.380729: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-07-17 08:35:45.385722: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-07-17 08:35:45.385868: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX_VNNI AMX_TILE AMX_INT8 AMX_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-07-17 08:35:46.823486: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-07-17 08:35:48.338086: W external/local_tsl/tsl/lib/monitoring/collection_registry.cc:81] Trying to register 2 metrics with the same name: /tensorflow/core/bfc_allocator_delay. The old value will be erased in order to register a new one. Please check if you link the metric more than once, or if the name is already used by other metrics.
2024-07-17 08:35:48.338212: W external/local_tsl/tsl/lib/monitoring/collection_registry.cc:81] Trying to register 2 metrics with the same name: /xla/service/gpu/compiled_programs_count. The old value will be erased in order to register a new one. Please check if you link the metric more than once, or if the name is already used by other metrics.
2024-07-17 08:35:48.338993: W external/local_tsl/tsl/lib/monitoring/collection_registry.cc:81] Trying to register 2 metrics with the same name: /jax/pjrt/pjrt_executable_executions. The old value will be erased in order to register a new one. Please check if you link the metric more than once, or if the name is already used by other metrics.
2024-07-17 08:35:48.339006: W external/local_tsl/tsl/lib/monitoring/collection_registry.cc:81] Trying to register 2 metrics with the same name: /jax/pjrt/pjrt_executable_execution_time_usecs. The old value will be erased in order to register a new one. Please check if you link the metric more than once, or if the name is already used by other metrics.
2024-07-17 08:35:48.775551: I itex/core/wrapper/itex_gpu_wrapper.cc:38] Intel Extension for Tensorflow* GPU backend is loaded.
2024-07-17 08:35:48.776670: I external/local_xla/xla/pjrt/pjrt_api.cc:67] PJRT_Api is set for device type xpu
2024-07-17 08:35:48.776689: I external/local_xla/xla/pjrt/pjrt_api.cc:72] PJRT plugin for XPU has PJRT API version 0.33. The framework PJRT API version is 0.34.
2024-07-17 08:35:48.794181: E external/intel_xla/xla/stream_executor/sycl/sycl_gpu_runtime.cc:178] Can not found any devices.
2024-07-17 08:35:48.794224: E itex/core/kernels/xpu_kernel.cc:60] Failed precondition: No visible XPU devices. To check runtime environment on your host, please run itex/tools/python/env_check.py.
If you need help, create an issue at https://github.com/intel/intel-extension-for-tensorflow/issues
2024-07-17 08:35:48.837084: E itex/core/devices/gpu/itex_gpu_runtime.cc:174] Can not found any devices. To check runtime environment on your host, please run itex/tools/python/env_check.py.
If you need help, create an issue at https://github.com/intel/intel-extension-for-tensorflow/issues
        Tensorflow 2.15.1 is installed.
Check Tensorflow Passed

Check Intel GPU Driver
Package: intel-level-zero-gpu
Status: install ok installed
Priority: optional
Section: libs
Installed-Size: 28728
Maintainer: Intel Graphics Team <linux-graphics@intel.com>
Architecture: amd64
Source: intel-compute-runtime
Version: 1.3.29138.29-881~22.04
Depends: libc6 (>= 2.34), libgcc-s1 (>= 3.4), libigdgmm12 (>= 22.3.18), libstdc++6 (>= 12), libigc1 (>= 1.0.12812), libigdfcl1 (>= 1.0.12812), libnl-3-200, libnl-route-3-200
Description: Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
 Level Zero is the primary low-level interface for language and runtime
 libraries. Level Zero offers fine-grain control over accelerators
 capabilities, delivering a simplified and low-latency interface to
 hardware, and efficiently exposing hardware capabilities to applications.
Homepage: https://github.com/oneapi-src/level-zero
Original-Maintainer: Debian OpenCL Maintainers <pkg-opencl-devel@lists.alioth.debian.org>
Package: intel-opencl-icd
Status: install ok installed
Priority: optional
Section: libs
Installed-Size: 22971
Maintainer: Intel Graphics Team <linux-graphics@intel.com>
Architecture: amd64
Source: intel-compute-runtime
Version: 24.13.29138.29-881~22.04
Replaces: intel-opencl
Provides: opencl-icd
Depends: libc6 (>= 2.34), libgcc-s1 (>= 3.4), libigdgmm12 (>= 22.3.18), libstdc++6 (>= 12), ocl-icd-libopencl1, libigc1 (>= 1.0.12812), libigdfcl1 (>= 1.0.12812)
Recommends: intel-igc-cm (>= 1.0.100)
Breaks: intel-opencl
Conffiles:
 /etc/OpenCL/vendors/intel.icd d0a34d0b4f75385c56ee357bb1b8e2d0
Description: Intel graphics compute runtime for OpenCL
 The Intel(R) Graphics Compute Runtime for OpenCL(TM) is a open source
 project to converge Intel's development efforts on OpenCL(TM) compute
 stacks supporting the GEN graphics hardware architecture.
 .
 Supported platforms:
 - Intel Core Processors with Gen8 GPU (Broadwell) - OpenCL 2.1
 - Intel Core Processors with Gen9 GPU (Skylake, Kaby Lake, Coffee Lake) - OpenCL 2.1
 - Intel Atom Processors with Gen9 GPU (Apollo Lake, Gemini Lake) - OpenCL 1.2
 - Intel Core Processors with Gen11 GPU (Ice Lake) - OpenCL 2.1
 - Intel Core Processors with Gen12 graphics devices (formerly Tiger Lake) - OpenCL 2.1
Homepage: https://github.com/intel/compute-runtime
Original-Maintainer: Debian OpenCL Maintainers <pkg-opencl-devel@lists.alioth.debian.org>
Package: level-zero
Status: install ok installed
Priority: optional
Section: libs
Installed-Size: 1514
Maintainer: Intel Graphics Team <linux-graphics@intel.com>
Architecture: amd64
Source: level-zero-loader
Version: 1.16.15-881~22.04
Depends: libc6 (>= 2.34), libgcc-s1 (>= 3.3.1), libstdc++6 (>= 11)
Description: Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
 Level Zero is the primary low-level interface for language and runtime
 libraries. Level Zero offers fine-grain control over accelerators
 capabilities, delivering a simplified and low-latency interface to
 hardware, and efficiently exposing hardware capabilities to applications.
 .
 This package provides the loader for oneAPI Level Zero compute runtimes.
Homepage: https://github.com/oneapi-src/level-zero
Package: libigc1
Status: install ok installed
Priority: optional
Section: libs
Installed-Size: 88209
Maintainer: Intel Graphics Team <linux-graphics@intel.com>
Architecture: amd64
Source: intel-graphics-compiler
Version: 1.0.16510.19-881~22.04
Depends: libc6 (>= 2.34), libgcc-s1 (>= 3.4), libstdc++6 (>= 12), zlib1g (>= 1:1.2.2)
Description: Intel graphics compiler for OpenCL -- core libs
 The Intel(R) Graphics Compiler for OpenCL(TM) is an llvm based compiler
 for OpenCL(TM) targeting Intel Gen graphics hardware architecture.
 .
 This package includes the core libraries.
Homepage: https://github.com/intel/intel-graphics-compiler
Original-Maintainer: Debian OpenCL team <pkg-opencl-devel@lists.alioth.debian.org>
Package: libigdfcl1
Status: install ok installed
Priority: optional
Section: libs
Installed-Size: 116119
Maintainer: Intel Graphics Team <linux-graphics@intel.com>
Architecture: amd64
Source: intel-graphics-compiler
Version: 1.0.16510.19-881~22.04
Depends: libc6 (>= 2.34), libgcc-s1 (>= 3.4), libstdc++6 (>= 11), zlib1g (>= 1:1.2.0), libz3-4 (>= 4.7.1)
Description: Intel graphics compiler for OpenCL -- OpenCL library
 The Intel(R) Graphics Compiler for OpenCL(TM) is an llvm based compiler
 for OpenCL(TM) targeting Intel Gen graphics hardware architecture.
 .
 This package includes the library for OpenCL.
Homepage: https://github.com/intel/intel-graphics-compiler
Original-Maintainer: Debian OpenCL team <pkg-opencl-devel@lists.alioth.debian.org>
Package: libigdgmm12
Status: install ok installed
Priority: optional
Section: libs
Installed-Size: 648
Maintainer: Intel Graphics Team <linux-graphics@intel.com>
Architecture: amd64
Multi-Arch: same
Source: intel-gmmlib
Version: 22.3.18-857~22.04
Replaces: libigdgmm11
Depends: libc6 (>= 2.34), libgcc-s1 (>= 3.3.1), libstdc++6 (>= 4.1.1)
Description: Intel Graphics Memory Management Library -- shared library
 The Intel Graphics Memory Management Library provides device specific
 and buffer management for the Intel Graphics Compute Runtime for
 OpenCL and the Intel Media Driver for VAAPI.
 .
 This library is only useful for Broadwell and newer CPUs.
 .
 This package includes the shared library.
Homepage: https://github.com/intel/gmmlib
Original-Maintainer: Debian Multimedia Maintainers <debian-multimedia@lists.debian.org>
Check Intel GPU Driver Passsed

Check OneAPI
   3091371:     find library=libsycl.so.7 [0]; searching
   3091371:       trying file=/opt/intel/oneapi/vpl/2023.0.0/lib/libsycl.so.7
   3091371:       trying file=/opt/intel/oneapi/tbb/2021.13/env/../lib/intel64/gcc4.8/libsycl.so.7
   3091371:       trying file=/opt/intel/oneapi/mpi/2021.13/opt/mpi/libfabric/lib/libsycl.so.7
   3091371:       trying file=/opt/intel/oneapi/mpi/2021.13/lib/libsycl.so.7
   3091371:       trying file=/opt/intel/oneapi/mkl/2024.2/lib/libsycl.so.7
   3091371:       trying file=/opt/intel/oneapi/itac/2022.0/slib/libsycl.so.7
   3091371:       trying file=/opt/intel/oneapi/ippcp/2021.12/lib/libsycl.so.7
   3091371:       trying file=/opt/intel/oneapi/ipp/2021.12/lib/libsycl.so.7
   3091371:       trying file=/opt/intel/oneapi/dpl/2022.6/lib/libsycl.so.7
   3091371:       trying file=/opt/intel/oneapi/dnnl/2024.2/lib/libsycl.so.7
   3091371:       trying file=/opt/intel/oneapi/debugger/2024.2/opt/debugger/lib/libsycl.so.7
   3091371:       trying file=/opt/intel/oneapi/dal/2024.5/lib/libsycl.so.7
   3091371:       trying file=/opt/intel/oneapi/compiler/2024.2/opt/compiler/lib/libsycl.so.7
   3091371:       trying file=/opt/intel/oneapi/compiler/2024.2/lib/libsycl.so.7
   3091371:     calling init: /opt/intel/oneapi/compiler/2024.2/lib/libsycl.so.7
   3091371:     calling fini: /opt/intel/oneapi/compiler/2024.2/lib/libsycl.so.7 [0]
        Intel(R) OneAPI DPC++/C++ Compiler is Installed.
Recommended dpcpp version is 2024.1.0-963
   3091371:     find library=libmkl_sycl_blas.so.4 [0]; searching
   3091371:       trying file=/opt/intel/oneapi/vpl/2023.0.0/lib/libmkl_sycl_blas.so.4
   3091371:       trying file=/opt/intel/oneapi/tbb/2021.13/env/../lib/intel64/gcc4.8/libmkl_sycl_blas.so.4
   3091371:       trying file=/opt/intel/oneapi/mpi/2021.13/opt/mpi/libfabric/lib/libmkl_sycl_blas.so.4
   3091371:       trying file=/opt/intel/oneapi/mpi/2021.13/lib/libmkl_sycl_blas.so.4
   3091371:       trying file=/opt/intel/oneapi/mkl/2024.2/lib/libmkl_sycl_blas.so.4
   3091371:     calling init: /opt/intel/oneapi/mkl/2024.2/lib/libmkl_sycl_blas.so.4
   3091371:     calling fini: /opt/intel/oneapi/mkl/2024.2/lib/libmkl_sycl_blas.so.4 [0]
   3091371:     find library=libmkl_sycl_lapack.so.4 [0]; searching
   3091371:       trying file=/opt/intel/oneapi/vpl/2023.0.0/lib/libmkl_sycl_lapack.so.4
   3091371:       trying file=/opt/intel/oneapi/tbb/2021.13/env/../lib/intel64/gcc4.8/libmkl_sycl_lapack.so.4
   3091371:       trying file=/opt/intel/oneapi/mpi/2021.13/opt/mpi/libfabric/lib/libmkl_sycl_lapack.so.4
   3091371:       trying file=/opt/intel/oneapi/mpi/2021.13/lib/libmkl_sycl_lapack.so.4
   3091371:       trying file=/opt/intel/oneapi/mkl/2024.2/lib/libmkl_sycl_lapack.so.4
   3091371:     calling init: /opt/intel/oneapi/mkl/2024.2/lib/libmkl_sycl_lapack.so.4
   3091371:     calling fini: /opt/intel/oneapi/mkl/2024.2/lib/libmkl_sycl_lapack.so.4 [0]
   3091371:     find library=libmkl_sycl_dft.so.4 [0]; searching
   3091371:       trying file=/opt/intel/oneapi/vpl/2023.0.0/lib/libmkl_sycl_dft.so.4
   3091371:       trying file=/opt/intel/oneapi/tbb/2021.13/env/../lib/intel64/gcc4.8/libmkl_sycl_dft.so.4
   3091371:       trying file=/opt/intel/oneapi/mpi/2021.13/opt/mpi/libfabric/lib/libmkl_sycl_dft.so.4
   3091371:       trying file=/opt/intel/oneapi/mpi/2021.13/lib/libmkl_sycl_dft.so.4
   3091371:       trying file=/opt/intel/oneapi/mkl/2024.2/lib/libmkl_sycl_dft.so.4
   3091371:     calling init: /opt/intel/oneapi/mkl/2024.2/lib/libmkl_sycl_dft.so.4
   3091371:     calling fini: /opt/intel/oneapi/mkl/2024.2/lib/libmkl_sycl_dft.so.4 [0]
        Intel(R) OneAPI Math Kernel Library is Installed.
Recommended onemkl version is 2024.1.0-691
Check OneAPI Passed

Check Tensorflow Requirements

Check Intel(R) Extension for TensorFlow* Requirements Passed

wangkl2 commented 3 months ago

@ch0801 Via your output of env check tool, the XPU are still not detected by the framework during runtime. ITEX v2.15.0.0 typically works with oneAPI base Toolkit 2024.1 while you installed oneAPI 2024.2. But using either 2024.1 or 2024.2 dpcpp/mkl works on my side, with PVC devices detected. Let me ping you internally to look into this issue.

Millionarc commented 1 month ago

@ch0801 Via your output of env check tool, the XPU are still not detected by the framework during runtime. ITEX v2.15.0.0 typically works with oneAPI base Toolkit 2024.1 while you installed oneAPI 2024.2. But using either 2024.1 or 2024.2 dpcpp/mkl works on my side, with PVC devices detected. Let me ping you internally to look into this issue.

has there been a fix? running into the same issue

wangkl2 commented 1 month ago

@Millionarc Please provide your output for running the env_check.py.

wangkl2 commented 1 month ago

Worked with @ch0801's team and confirmed the issue has gone now. XPU can be detected and the workloads are able to execute on the GPUs. Close it.

intel / intel-extension-for-tensorflow

No visible XPU devices when running intel-extension-for-tensorflow/tree/main/examples/train_maskrcnn on PVC #75