Open raevillena opened 5 months ago
update: ok so I was able to solve it
after reading all issues and documents I could here is what I did coming from a reboot of the wsl
in your terminal do this without starting the virtual environment:
export ITEX_LIMIT_MEMORY_SIZE_IN_MB=1024
if you have fp64 issues do this too
export OverrideDefaultFP64Settings=1
export IGC_EnableDPEmulation=1
then forcefully source your vars (still in the main environment)
source /opt/intel/oneapi/setvars.sh --force
you may now activate conda environment, and set the variables again all of them if you may:
export OverrideDefaultFP64Settings=1
export IGC_EnableDPEmulation=1
export ITEX_LIMIT_MEMORY_SIZE_IN_MB=1024
you may check using
printenv
this will list the variables in the conda environment
then you may now start the usage of tf. in my case:
jupyter notebook
all this happened without reinstallation of my system.
@raevillena Can you help to check if our latest weekly release still has this issue? thanks.
pip install --upgrade intel-extension-for-tensorflow-weekly[xpu] -f https://developer.intel.com/itex-whl-weekly
I just tried right now without exporting any env variables i mentioned above but still give me
NotFoundError: libsycl.so.7: cannot open shared object file: No such file or directory
this can be solved using source /opt/intel/oneapi/setvars.sh --force
Now I tried solving it with just setvars without setting the limit memory but no. the memory bug is still there.
but the fp64 emulation is now working without setting env
Can you help to share the result of
pip list | grep intel_extension_for_tensorflow
Hi here it is,
(itex) rae@DESKTOP-URAMFL5:~$ pip list | grep intel_extension_for_tensorflow
intel_extension_for_tensorflow 2.15.0.0
intel_extension_for_tensorflow_lib 2.15.0.0.2
intel_extension_for_tensorflow_lib_weekly 2.15.0.1.2.dev20240603
intel_extension_for_tensorflow_weekly 2.15.0.1.dev2024060
is there another step to do for the newer library gets used by default? or that was it?
please help to remove the "intel_extension_for_tensorflow" and "intel_extension_for_tensorflow_lib"
Hi, can I test that after doing some modelling first. it works (and not sometimes) for now.
I can tell already the update made the gpu use memory but uses the cpu to process. cpu went up 100% with 0 from gpu which was used to be using the gpu as xpu from the original build. but let me restart the wsl to confirm everything. my models went up from 5 sec training per epoch to 130 sec which is not what I expect.
the update was no longer using the gpu tho
this line was no longer in the logs
[2024-06-28 19:43:19.472249: I itex/core/wrapper/itex_gpu_wrapper.cc:38] Intel Extension for Tensorflow* GPU backend is loaded.
so it was purely using cpu now
I did remove all itex and just installed the weekly build. the gpu gets mounted again but all the errors came back with it too. back to 0
Hi @raevillena How can I reproduce your issue ?
Hi @raevillena How can I reproduce your issue ?
Hi @feng-intel this is the summary
Hardware setup:
Ubuntu 22.04 on WSL2
Host: Windows 11 enterprise
32Gb ram 3600ddr4
AMD 5700x CPU
Intel Arc A750 8GB
wsl2:
ubuntu22.04 official distro
(this runs on Microsoft special kernet
running uname -r
5.15.153.1-microsoft-standard-WSL2
from fresh installation: following steps here: https://github.com/intel/intel-extension-for-tensorflow/blob/main/docs/install/experimental/install_for_arc_gpu.md
sudo apt-get install -y gpg-agent wget
wget -qO - https://repositories.intel.com/gpu/intel-graphics.key |
sudo gpg --dearmor --output /usr/share/keyrings/intel-graphics.gpg
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu jammy/lts/2350 unified" | sudo tee /etc/apt/sources.list.d/intel-gpu-jammy.list
sudo apt-get update
then
sudo apt-get install \
intel-igc-cm \
intel-level-zero-gpu \
intel-opencl-icd \
level-zero \
libigc1 \
libigdfcl1 \
libigdgmm12
I needed to install the whole oneapi cause i needed the source setvars
wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/fdc7a2bc-b7a8-47eb-8876-de6201297144/l_BaseKit_p_2024.1.0.596.sh
sudo sh ./l_BaseKit_p_2024.1.0.596.sh
then
source /opt/intel/oneapi/setvars.sh
setting up my conda environment: https://intel.github.io/intel-extension-for-tensorflow/latest/docs/install/experimental/install_for_gpu_conda.html
curl https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -o Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
conda update conda
conda create -n itex -c intel intelpython3_full python=3.9
#I removed the version orig: conda create -n itex -c intel intelpython3_full==2023.2.0 python=3.9
activated my conda
conda activate itex
proceeded as documented
pip install --upgrade pip
pip install tensorflow==2.15.0
pip install intel-extension-for-tensorflow[xpu]
source /opt/intel/oneapi/compiler/latest/env/vars.sh
source /opt/intel/oneapi/mkl/latest/env/vars.sh
export path_to_site_packages=`python -c "import site; print(site.getsitepackages()[0])"`
bash ${path_to_site_packages}/intel_extension_for_tensorflow/tools/env_check.sh
but the output would say that there is no file or directory for env_check.sh cause there isn't in the latest version
then install the jupyter using these from here: https://www.intel.com/content/www/us/en/developer/articles/technical/running-tensorflow-stable-diffusion-on-intel-arc.html
pip install notebook
pip install keras tensorflow-datasets matplotlib ipywidgets
jupyter notebook
here is the sample model
import tensorflow as tf
base_model = tf.keras.applications.VGG16(include_top=False)
base_model.trainable = False
inputs = tf.keras.layers.Input(shape=(224, 224, 3), name="input_layer")
x = tf.keras.layers.experimental.preprocessing.Rescaling(1./255)(inputs)
x = base_model(inputs)
x = tf.keras.layers.GlobalAveragePooling2D(name="global_average_pooling_layer")(x)
outputs = tf.keras.layers.Dense(3, activation="softmax", name="output_layer")(x)
model_5 = tf.keras.Model(inputs, outputs)
model_5.compile(loss='categorical_crossentropy',
optimizer=tf.keras.optimizers.Adam(),
metrics=["accuracy"])
history5 = model_5.fit(train_data_50_test,
epochs=10,
steps_per_epoch=len(train_data_50_test),
validation_data=val_data_50_test,
validation_steps=int(0.5 * len(val_data_50_test)))
maybe you have a data there i cannot provide my own.
is there something I didn't mention except the exact logs? I don't want to redo the setup for the meantime I switched to use the cpu instead for now while waiting for a development on this.
@raevillena ,
thank you a lot for the details, if your environment is still on, could you please download the env_check.py using:
wget https://raw.githubusercontent.com/intel/intel-extension-for-tensorflow/v2.15.0.0/tools/python/env_check.py
and run it and let us know the output? python env_check.py
Thanks
Hi @yinghu5
here is the result
(itex) rae@DESKTOP-URAMFL5:~$ python env_check.py
Check Environment for Intel(R) Extension for TensorFlow*...
Check Python
Python 3.9.19 is Supported.
Check Python Passed
Check OS
OS ubuntu:22.04 is Supported
Check OS Passed
Check Tensorflow
Tensorflow 2.15.0 is installed.
Check Tensorflow Passed
Check Intel GPU Driver
Package: intel-level-zero-gpu
Status: install ok installed
Priority: optional
Section: libs
Installed-Size: 28239
Maintainer: Intel Graphics Team <linux-graphics@intel.com>
Architecture: amd64
Source: intel-compute-runtime
Version: 1.3.27642.52-803~22.04
Depends: libc6 (>= 2.34), libgcc-s1 (>= 3.4), libigdgmm12 (>= 22.3.15), libstdc++6 (>= 12), libigc1 (>= 1.0.12812), libigdfcl1 (>= 1.0.12812), libnl-3-200, libnl-route-3-200
Description: Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
Level Zero is the primary low-level interface for language and runtime
libraries. Level Zero offers fine-grain control over accelerators
capabilities, delivering a simplified and low-latency interface to
hardware, and efficiently exposing hardware capabilities to applications.
Homepage: https://github.com/oneapi-src/level-zero
Original-Maintainer: Debian OpenCL Maintainers <pkg-opencl-devel@lists.alioth.debian.org>
Package: intel-opencl-icd
Status: install ok installed
Priority: optional
Section: libs
Installed-Size: 23865
Maintainer: Intel Graphics Team <linux-graphics@intel.com>
Architecture: amd64
Source: intel-compute-runtime
Version: 23.43.27642.52-803~22.04
Replaces: intel-opencl
Provides: opencl-icd
Depends: libc6 (>= 2.34), libgcc-s1 (>= 3.4), libigdgmm12 (>= 22.3.15), libstdc++6 (>= 12), ocl-icd-libopencl1, libigc1 (>= 1.0.12812), libigdfcl1 (>= 1.0.12812)
Recommends: intel-igc-cm (>= 1.0.100)
Breaks: intel-opencl
Conffiles:
/etc/OpenCL/vendors/intel.icd d0a34d0b4f75385c56ee357bb1b8e2d0
Description: Intel graphics compute runtime for OpenCL
The Intel(R) Graphics Compute Runtime for OpenCL(TM) is a open source
project to converge Intel's development efforts on OpenCL(TM) compute
stacks supporting the GEN graphics hardware architecture.
.
Supported platforms:
- Intel Core Processors with Gen8 GPU (Broadwell) - OpenCL 2.1
- Intel Core Processors with Gen9 GPU (Skylake, Kaby Lake, Coffee Lake) - OpenCL 2.1
- Intel Atom Processors with Gen9 GPU (Apollo Lake, Gemini Lake) - OpenCL 1.2
- Intel Core Processors with Gen11 GPU (Ice Lake) - OpenCL 2.1
- Intel Core Processors with Gen12 graphics devices (formerly Tiger Lake) - OpenCL 2.1
Homepage: https://github.com/intel/compute-runtime
Original-Maintainer: Debian OpenCL Maintainers <pkg-opencl-devel@lists.alioth.debian.org>
Package: level-zero
Status: install ok installed
Priority: optional
Section: libs
Installed-Size: 1049
Maintainer: Intel Graphics Team <linux-graphics@intel.com>
Architecture: amd64
Source: level-zero-loader
Version: 1.14.0-744~22.04
Depends: libc6 (>= 2.34), libgcc-s1 (>= 3.3.1), libstdc++6 (>= 11)
Description: Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
Level Zero is the primary low-level interface for language and runtime
libraries. Level Zero offers fine-grain control over accelerators
capabilities, delivering a simplified and low-latency interface to
hardware, and efficiently exposing hardware capabilities to applications.
.
This package provides the loader for oneAPI Level Zero compute runtimes.
Homepage: https://github.com/oneapi-src/level-zero
Package: libigc1
Status: install ok installed
Priority: optional
Section: libs
Installed-Size: 86364
Maintainer: Intel Graphics Team <linux-graphics@intel.com>
Architecture: amd64
Source: intel-graphics-compiler
Version: 1.0.15468.29-803~22.04
Depends: libc6 (>= 2.34), libgcc-s1 (>= 3.4), libstdc++6 (>= 12), zlib1g (>= 1:1.2.2)
Description: Intel graphics compiler for OpenCL -- core libs
The Intel(R) Graphics Compiler for OpenCL(TM) is an llvm based compiler
for OpenCL(TM) targeting Intel Gen graphics hardware architecture.
.
This package includes the core libraries.
Homepage: https://github.com/intel/intel-graphics-compiler
Original-Maintainer: Debian OpenCL team <pkg-opencl-devel@lists.alioth.debian.org>
Package: libigdfcl1
Status: install ok installed
Priority: optional
Section: libs
Installed-Size: 116046
Maintainer: Intel Graphics Team <linux-graphics@intel.com>
Architecture: amd64
Source: intel-graphics-compiler
Version: 1.0.15468.29-803~22.04
Depends: libc6 (>= 2.34), libgcc-s1 (>= 3.4), libstdc++6 (>= 11), zlib1g (>= 1:1.2.0), libz3-4 (>= 4.7.1)
Description: Intel graphics compiler for OpenCL -- OpenCL library
The Intel(R) Graphics Compiler for OpenCL(TM) is an llvm based compiler
for OpenCL(TM) targeting Intel Gen graphics hardware architecture.
.
This package includes the library for OpenCL.
Homepage: https://github.com/intel/intel-graphics-compiler
Original-Maintainer: Debian OpenCL team <pkg-opencl-devel@lists.alioth.debian.org>
Package: libigdgmm12
Status: install ok installed
Priority: optional
Section: libs
Installed-Size: 648
Maintainer: Intel Graphics Team <linux-graphics@intel.com>
Architecture: amd64
Multi-Arch: same
Source: intel-gmmlib
Version: 22.3.15-803~22.04
Replaces: libigdgmm11
Depends: libc6 (>= 2.34), libgcc-s1 (>= 3.3.1), libstdc++6 (>= 4.1.1)
Description: Intel Graphics Memory Management Library -- shared library
The Intel Graphics Memory Management Library provides device specific
and buffer management for the Intel Graphics Compute Runtime for
OpenCL and the Intel Media Driver for VAAPI.
.
This library is only useful for Broadwell and newer CPUs.
.
This package includes the shared library.
Homepage: https://github.com/intel/gmmlib
Original-Maintainer: Debian Multimedia Maintainers <debian-multimedia@lists.debian.org>
Check Intel GPU Driver Passsed
Check OneAPI
Can't find dpcpp
Check OneAPI Failed
at the same time:
(itex) rae@DESKTOP-URAMFL5:~$ sudo apt install intel-oneapi-runtime-dpcpp-cpp
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
intel-oneapi-runtime-dpcpp-cpp is already the newest version (2024.2.0-981).
0 upgraded, 0 newly installed, 0 to remove and 47 not upgraded.
please enlighten me also, it works as long as a export there environment vars every opening my wsl instance
export OverrideDefaultFP64Settings=1
export IGC_EnableDPEmulation=1
export ITEX_LIMIT_MEMORY_SIZE_IN_MB=1024
source /opt/intel/oneapi/setvars.sh --force
Hi @raevillena
thank you!
From the result ,
seems you have two versions of oneAPI dpcpp 2024.1 and dpcpp-cpp newest version (2024.2.0-981) in the environment
Check OneAPI
Can't find dpcpp
Check OneAPI Failed
I recall, you had installed it
wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/fdc7a2bc-b7a8-47eb-8876-de6201297144/l_BaseKit_p_2024.1.0.596.sh
sudo sh ./l_BaseKit_p_2024.1.0.596.sh
As the current ITEX 2.15.0 was tested with oneAPI 2024.1. could you please remove sudo apt install intel-oneapi-runtime-dpcpp-cpp and $ source /opt/intel/oneapi/setvars.sh --force, $icx -V and $sycl-ls show the output? and run again about env_check and see if the OneAPI error can gone?
Second, i saw Breaks: intel-opencl Conffiles: /etc/OpenCL/vendors/intel.icd d0a34d0b4f75385c56ee357bb1b8e2d0 Description: Intel graphics compute runtime for OpenCL The Intel(R) Graphics Compute Runtime for OpenCL(TM) is a open source project to converge Intel's development efforts on OpenCL(TM) compute stacks supporting the GEN graphics hardware architecture. . Supported platforms:
third, about new ITEX version etc, like Guizi mentioned, try the next release.
pip install --upgrade intel-extension-for-tensorflow-weekly[xpu] -f https://developer.intel.com/itex-whl-weekly
and could you please try the simple hello-world program and show the result?
$wget https://[raw.githubusercontent.com/oneapi-src/oneAPI-samples/master/AI-and-Analytics/Getting-Started-Samples/IntelTensorFlow_GettingStarted/TensorFlow_HelloWorld.py](https://raw.githubusercontent.com/oneapi-src/oneAPI-samples/master/AI-and-Analytics/Getting-Started-Samples/IntelTensorFlow_GettingStarted/TensorFlow_HelloWorld.py)
$ python TensorFlow_HelloWorld.py
Hello @yinghu5
a) yes I installed a newer dpcpp but I have encountered the results in even before installing them to see if that was the only reason,
b)
(itex) rae@DESKTOP-URAMFL5:~$ sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2024.17.3.0.08_160000]
[opencl:cpu:1] Intel(R) OpenCL, AMD Ryzen 7 5700X 8-Core Processor OpenCL 3.0 (Build 0) [2024.17.3.0.08_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Graphics [0x56a1] OpenCL 3.0 NEO [23.43.27642.52]
[opencl:cpu:3] Intel(R) OpenCL, AMD Ryzen 7 5700X 8-Core Processor OpenCL 3.0 (Build 0) [2024.18.6.0.02_160000]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Graphics [0x56a1] 1.3 [1.3.27642]
(itex) rae@DESKTOP-URAMFL5:~$ icx -V
Intel(R) oneAPI DPC++/C++ Compiler for applications running on Intel(R) 64, Version 2024.1.0 Build 20240308
Copyright (C) 1985-2024 Intel Corporation. All rights reserved.
c) AMD platform is not included in your test devices?
d)
$wget https://[raw.githubusercontent.com/oneapi-src/oneAPI-samples/master/AI-and-Analytics/Getting-Started-Samples/IntelTensorFlow_GettingStarted/TensorFlow_HelloWorld.py](https://raw.githubusercontent.com/oneapi-src/oneAPI-samples/master/AI-and-Analytics/Getting-Started-Samples/IntelTensorFlow_GettingStarted/TensorFlow_HelloWorld.py) $ python TensorFlow_HelloWorld.py
Like I said in the previous replies that running simple commands will not result in this error, I could even run a simple 10 epoch transfer learning of EFFICIENTNETB0 model (this one is lighter than vgg16).
Hi @raevillena , a) how about after $ source /opt/intel/oneapi/setvars.sh --force , does the error still persist? b) it seems oneAPI environment works fine in your machine c) Right, it is not included in the validated devices. d) do you have output? what is output, does the it include some gpu log information?
Thanks
hello @yinghu5
a) how about after $ source /opt/intel/oneapi/setvars.sh --force , does the error still persist? nope that is the fix works for now <
using this each run solves most of the problem
export OverrideDefaultFP64Settings=1
export IGC_EnableDPEmulation=1
export ITEX_LIMIT_MEMORY_SIZE_IN_MB=1024
source /opt/intel/oneapi/setvars.sh --force
b) it seems oneAPI environment works fine in your machine
i think so too, it works after forcing setvars but doesnt persist after restart of instance or console
c) Right, it is not included in the validated devices.
cannot complain about that
d) do you have output? what is output, does the it include some gpu log information?
nope, once after sourcing the vars the only log it echoes are one in the initialization. after that it works as intended.
get it, thank for understanding :).
about the d), do you see the log:
Here is the code run on CPU, can't use GPU log
(itex214) ~$ python hello_tf.py
2024-07-08 16:53:09.912092: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0
.
2024-07-08 16:53:09.949268: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-07-08 16:53:10.101459: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-08 16:53:10.101523: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-08 16:53:10.102364: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-07-08 16:53:10.198395: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-07-08 16:53:10.199198: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-07-08 16:53:10.841688: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-07-08 16:53:11.435196: W itex/core/wrapper/itex_gpu_wrapper.cc:32] Could not load dynamic library: libimf.so: cannot open shared object file: No such file or directory
2024-07-08 16:53:11.544927: I itex/core/wrapper/itex_cpu_wrapper.cc:70] Intel Extension for Tensorflow* AVX2 CPU backend is loaded.
2024-07-08 16:53:11.576732: E itex/core/wrapper/itex_gpu_wrapper.cc:49] Could not load Intel Extension for Tensorflow GPU backend, GPU will not be used.
If you need help, create an issue at https://github.com/intel/intel-extension-for-tensorflow/issues
2024-07-08 16:53:11.577016: E itex/core/wrapper/itex_gpu_wrapper.cc:49] Could not load Intel Extension for Tensorflow GPU backend, GPU will not be used.
If you need help, create an issue at https://github.com/intel/intel-extension-for-tensorflow/issues
WARNING:tensorflow:From /home/yhu5/miniconda3/envs/itex214/lib/python3.10/site-packages/tensorflow/python/compat/v2_compat.py:108: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
2024-07-08 16:53:11.725437: E itex/core/wrapper/itex_gpu_wrapper.cc:49] Could not load Intel Extension for Tensorflow GPU backend, GPU will not be used.
If you need help, create an issue at https://github.com/intel/intel-extension-for-tensorflow/issues
2024-07-08 16:53:11.725575: E itex/core/wrapper/itex_gpu_wrapper.cc:49] Could not load Intel Extension for Tensorflow GPU backend, GPU will not be used.
If you need help, create an issue at https://github.com/intel/intel-extension-for-tensorflow/issues
2024-07-08 16:53:11.725890: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:382] MLIR V1 optimization pass is not enabled
2024-07-08 16:53:11.728693: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type CPU is enabled.
0 0.43929783
1 0.36791593
2 0.34823328
3 0.33959246
4 0.33490422
Tensorflow HelloWorld Done!
[CODE_SAMPLE_COMPLETED_SUCCESFULLY]
and if i source /opt/intel/oneapi/mkl/2024.1/env/vars.sh
source /opt/intel/oneapi/compiler/2024.1/env/vars.sh
and the below code will run on GPU:
(itex214) yhu5@arc770-tce:~$ python hello_tf.py
2024-07-08 16:56:20.460059: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0
.
2024-07-08 16:56:20.461009: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-07-08 16:56:20.476518: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-08 16:56:20.476533: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-08 16:56:20.476546: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-07-08 16:56:20.479603: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-07-08 16:56:20.479713: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-07-08 16:56:20.833677: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-07-08 16:56:21.834356: I *itex/core/wrapper/itex_gpu_wrapper.cc:35] Intel Extension for Tensorflow GPU backend is loaded.*
2024-07-08 16:56:21.856422: I itex/core/wrapper/itex_cpu_wrapper.cc:70] Intel Extension for Tensorflow AVX2 CPU backend is loaded.
2024-07-08 16:56:21.982129: I itex/core/devices/gpu/itex_gpu_runtime.cc:129] Selected platform: Intel(R) Level-Zero
2024-07-08 16:56:21.982435: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
2024-07-08 16:56:21.982447: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
WARNING:tensorflow:From /home/yhu5/miniconda3/envs/itex214/lib/python3.10/site-packages/tensorflow/python/compat/v2_compat.py:108: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
2024-07-08 16:56:22.085860: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform XPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-07-08 16:56:22.085881: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform XPU ID 1, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-07-08 16:56:22.085890: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:XPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: XPU, pci bus id:
Hi @raevillena
is there any update?
back to the original question, the GPU memory limit ~7.4G, and CPU memory limit is about 14.9G , I did further investigation and it seems the memory bug is still caused by OOM of the GPU memory.
The below is my test code, could you please try it (without any environment variable setting) change the datasize = 100, or 1000 and show the output?
my machine with 8G GPU memory, the below code can run with datasize =100,
but failed when datasize=1000. it even failed very earier at the call preprocess_image_input return tf.image.resize(output_ims, [224, 224])
log info: ran out of memory of XPU. 2024-07-09 11:16:07.554374: W external/tsl/tsl/framework/bfc_allocator.cc:500] Allocator (XPU_0_bfc) ran out of memory trying to allocate 573.64MiB (rounded to 601509888)requested by op ....
2024-07-09 11:10:44.057305: I external/tsl/tsl/framework/bfc_allocator.cc:1124] Total bytes in pool: 982310912 memorylimit: 6039281664 available bytes: 5056970752 curr_region_allocationbytes: 12078563328
Thanks
python vgg16.py (without any environment variable)
import tensorflow as tf
from keras.datasets import cifar10
from keras.utils import to_categorical
(x_train, y_train), (x_test, y_test) = cifar10.load_data() # x_train - training data(images), y_train - labels(digits)
print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')
def preprocess_image_input(input_images):
#input_images = input_images.astype('float32')
output_ims = tf.keras.applications.vgg16.preprocess_input(input_images)
print('output_ims:', output_ims.shape)
return tf.image.resize(output_ims, [224, 224])
nb_classes = 10
datasize=100
y_train = to_categorical(y_train[1:datasize], nb_classes)
y_test = to_categorical(y_test[1:datasize], nb_classes)
print ("Train shape", x_train.shape, y_train.shape)
train_data_50_test=preprocess_image_input(x_train[1:datasize])
val_data_50_test = preprocess_image_input(x_test[1:datasize])
print ("================model training=============")
base_model = tf.keras.applications.VGG16(include_top=False)
base_model.trainable = False
inputs = tf.keras.layers.Input(shape=(224, 224, 3), name="input_layer")
x = tf.keras.layers.experimental.preprocessing.Rescaling(1./255)(inputs)
x = base_model(inputs)
x = tf.keras.layers.GlobalAveragePooling2D(name="global_average_pooling_layer")(x)
outputs = tf.keras.layers.Dense(nb_classes, activation="softmax", name="output_layer")(x)
model_5 = tf.keras.Model(inputs, outputs)
model_5.compile(loss='categorical_crossentropy',
optimizer=tf.keras.optimizers.Adam(),
metrics=["accuracy"])
history5 = model_5.fit(train_data_50_test,y_train,
epochs=2,verbose=1,
batch_size=10,
# steps_per_epoch=len(train_data_50_test),
validation_data=(val_data_50_test, y_test),
# validation_steps=int(0.5 * len(val_data_50_test))
)
#result = model.evaluate(val_data_50_test)
datasize = 100 output
2024-07-09 11:03:48.833394: I itex/core/wrapper/itex_gpu_wrapper.cc:38] Intel Extension for Tensorflow* GPU backend is loaded.
2024-07-09 11:03:48.834617: I external/local_xla/xla/pjrt/pjrt_api.cc:67] PJRT_Api is set for device type xpu
2024-07-09 11:03:48.834660: I external/local_xla/xla/pjrt/pjrt_api.cc:72] PJRT plugin for XPU has PJRT API version 0.33. The framework PJRT API version is 0.34.
2024-07-09 11:03:49.216499: I external/intel_xla/xla/stream_executor/sycl/sycl_gpu_runtime.cc:134] Selected platform: Intel(R) Level-Zero
2024-07-09 11:03:49.216817: I external/intel_xla/xla/stream_executor/sycl/sycl_gpu_runtime.cc:159] number of sub-devices is zero, expose root device.
2024-07-09 11:03:49.220242: I external/xla/xla/service/service.cc:168] XLA service 0x5579164cd630 initialized for platform SYCL (this does not guarantee that XLA will be used). Devices:
2024-07-09 11:03:49.220280: I external/xla/xla/service/service.cc:176] StreamExecutor device (0): Intel(R) Graphics [0x9a49], <undefined>
2024-07-09 11:03:49.221860: I itex/core/devices/gpu/itex_gpu_runtime.cc:130] Selected platform: Intel(R) Level-Zero
2024-07-09 11:03:49.222135: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device.
2024-07-09 11:03:49.225126: I external/intel_xla/xla/pjrt/se_xpu_pjrt_client.cc:97] Using BFC allocator.
2024-07-09 11:03:49.225174: I external/xla/xla/pjrt/gpu/gpu_helpers.cc:106] XLA backend allocating 6039281664 bytes on device 0 for BFCAllocator.
2024-07-09 11:03:49.227397: I external/local_xla/xla/pjrt/pjrt_c_api_client.cc:119] PjRtCApiClient created.
x_train shape: (50000, 32, 32, 3)
50000 train samples
10000 test samples
Train shape (50000, 32, 32, 3) (99, 10)
output_ims: (99, 32, 32, 3)
2024-07-09 11:03:50.833135: I tensorflow/core/common_runtime/next_pluggable_device/next_pluggable_device_factory.cc:118] Created 1 TensorFlow NextPluggableDevices. Physical device type: XPU
output_ims: (99, 32, 32, 3)
2024-07-09 11:04:14.986917: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type XPU is enabled.
10/10 [==============================] - 124s 1s/step - loss: 4.2215 - accuracy: 0.1818 - val_loss: 4.3642 - val_accuracy: 0.1616
datasize=1000
`2024-07-09 11:10:33.292887: I itex/core/wrapper/itex_gpu_wrapper.cc:38] Intel Extension for Tensorflow* GPU backend is loaded.
2024-07-09 11:10:33.293279: I external/local_xla/xla/pjrt/pjrt_api.cc:67] PJRT_Api is set for device type xpu
2024-07-09 11:10:33.293315: I external/local_xla/xla/pjrt/pjrt_api.cc:72] PJRT plugin for XPU has PJRT API version 0.33. The framework PJRT API version is 0.34.
2024-07-09 11:10:33.311334: I external/intel_xla/xla/stream_executor/sycl/sycl_gpu_runtime.cc:134] Selected platform: Intel(R) Level-Zero
2024-07-09 11:10:33.311711: I external/intel_xla/xla/stream_executor/sycl/sycl_gpu_runtime.cc:159] number of sub-devices is zero, expose root device.
2024-07-09 11:10:33.319263: I external/xla/xla/service/service.cc:168] XLA service 0x560623f3a810 initialized for platform SYCL (this does not guarantee that XLA will be used). Devices:
2024-07-09 11:10:33.319314: I external/xla/xla/service/service.cc:176] StreamExecutor device (0): Intel(R) Graphics [0x9a49],
2024-07-09 11:10:44.057461: W external/tsl/tsl/framework/bfcallocator.cc:512] ***____ Segmentation fault (itex) yhu5@rajeshch-desk89:~$`
hi @yinghu5
the hello runs fine
2024-07-09 14:12:47.836324: I tensorflow/core/common_runtime/next_pluggable_device/next_pluggable_device_factory.cc:118] Created 1 TensorFlow NextPluggableDevices. Physical device type: XPU
2024-07-09 14:12:47.836850: I tensorflow/core/common_runtime/next_pluggable_device/next_pluggable_device_factory.cc:118] Created 1 TensorFlow NextPluggableDevices. Physical device type: XPU
0 0.40498763
1 0.3569235
2 0.34184435
3 0.33502558
4 0.33131737
Tensorflow HelloWorld Done!
[CODE_SAMPLE_COMPLETED_SUCCESFULLY]
unfortunately I cannot run the code you did. it throws all errors on my side and I just gave up after the 10th error
Sorry, change the file format in last comment, please try again, the format is like
It looks like an OOM error.
what is consistent is that it tries to allocate that curr region allocation bytes: 14975071232 that value was very consistent. which I dont know why. It makes sense the the oom happens with that but why allocate 14gb when tf doesnt even need that much for the current workload.
Let me explain why it always tries to allocate such consistent bytes: 14975071232: ITEX has a memory allocator create the runtime memory pool, it
(total_memory_size - reserved_memory_size) * 0.75
. You can see an original memory_limit_: 7487535513
in early log, that's it. NOTE the reserved_memory_size
is for some HW internal data, the the ratio 0.75 is inherited from public community to ensure this process won't exhaust all HW resource.current_size * 2
if any allocation is failed. You can see the extended size curr_region_allocation_bytes_
is always ~2x of the original size memory_limit_
, that's the reason.Based on the above logic, it's easy to fail when the extending operation is triggered. The question is, why is the extending triggered even the memory pool still has free space? Maybe the allocation needs more space, or the pool is fragmentation.
We will have a deeper look and give more info later, thanks!
Sorry, change the file format in last comment, please try again, the format is like
hi @yinghu5 sorry, maybe I can't run it cause I have upgraded my keras version to v3. It has giving me incompatibilities with the data preparation stage with shapes mismatch. I can't troubleshoot it and replicate my linux instance with the same spec as your for now since I am doing some academic experiments. I'll do that after some time.
It looks like an OOM error.
what is consistent is that it tries to allocate that curr region allocation bytes: 14975071232 that value was very consistent. which I dont know why. It makes sense the the oom happens with that but why allocate 14gb when tf doesnt even need that much for the current workload.
Let me explain why it always tries to allocate such consistent bytes: 14975071232: ITEX has a memory allocator create the runtime memory pool, it
- Initialized as
(total_memory_size - reserved_memory_size) * 0.75
. You can see an originalmemory_limit_: 7487535513
in early log, that's it. NOTE thereserved_memory_size
is for some HW internal data, the the ratio 0.75 is inherited from public community to ensure this process won't exhaust all HW resource.- It will try to extend to
current_size * 2
if any allocation is failed. You can see the extended sizecurr_region_allocation_bytes_
is always ~2x of the original sizememory_limit_
, that's the reason.Based on the above logic, it's easy to fail when the extending operation is triggered. The question is, why is the extending triggered even the memory pool still has free space? Maybe the allocation needs more space, or the pool is fragmentation.
We will have a deeper look and give more info later, thanks!
thanks for that deep information! Ill completely cooperate with it later too. thanks!
right, FYI, I'm using intel_extension_for_tensorflow 2.15.0.0 intel_extension_for_tensorflow_lib 2.15.0.0.2 keras 2.15.0
thanks
I am having memory issue with the running things. Everything works except that training bigger data crashes the kernel of jupyter notebook.
System Desktop
Setup: miniconda3 on itex environment
running model fit with train data results to (especially with vgg, resnet works fine):
it crashes no matter what I do when it tries to allocated that 14gb in the curr_region_allocation
Global mem shows:
btw my version of itex didnt came with
check_env.sh
so I cant run that, I just know it works cause it does and it doesnt.In jupyter the device is recognized as this
Also the other setups I can read about issues of bfc allocator uses the one that came along with the tensorflow while mine was coming from itex build files.
I could see that the repo is available for rebuilding and there might be chance to find what is happening there but I dont have the time and ability to do so.
I just wanna know if there what am I missing here since it was able allocate almost 8gb memory but unable to expand it.
I also tried exporting this to the conda environment with no effect
export ITEX_LIMIT_MEMORY_SIZE_IN_MB=4096
I said earlier that it works, yes I can train a resnet model blazingly fast compared to tesla t4 in colab but running it twice give the memory error.
what is consistent is that it tries to allocate that curr region allocation bytes: 14975071232 that value was very consistent. which I dont know why. It makes sense the the oom happens with that but why allocate 14gb when tf doesnt even need that much for the current workload.