Open raevillena opened 4 days ago
update: ok so I was able to solve it
after reading all issues and documents I could here is what I did coming from a reboot of the wsl
in your terminal do this without starting the virtual environment:
export ITEX_LIMIT_MEMORY_SIZE_IN_MB=1024
if you have fp64 issues do this too
export OverrideDefaultFP64Settings=1
export IGC_EnableDPEmulation=1
then forcefully source your vars (still in the main environment)
source /opt/intel/oneapi/setvars.sh --force
you may now activate conda environment, and set the variables again all of them if you may:
export OverrideDefaultFP64Settings=1
export IGC_EnableDPEmulation=1
export ITEX_LIMIT_MEMORY_SIZE_IN_MB=1024
you may check using
printenv
this will list the variables in the conda environment
then you may now start the usage of tf. in my case:
jupyter notebook
all this happened without reinstallation of my system.
@raevillena Can you help to check if our latest weekly release still has this issue? thanks.
pip install --upgrade intel-extension-for-tensorflow-weekly[xpu] -f https://developer.intel.com/itex-whl-weekly
I just tried right now without exporting any env variables i mentioned above but still give me
NotFoundError: libsycl.so.7: cannot open shared object file: No such file or directory
this can be solved using source /opt/intel/oneapi/setvars.sh --force
Now I tried solving it with just setvars without setting the limit memory but no. the memory bug is still there.
but the fp64 emulation is now working without setting env
Can you help to share the result of
pip list | grep intel_extension_for_tensorflow
Hi here it is,
(itex) rae@DESKTOP-URAMFL5:~$ pip list | grep intel_extension_for_tensorflow
intel_extension_for_tensorflow 2.15.0.0
intel_extension_for_tensorflow_lib 2.15.0.0.2
intel_extension_for_tensorflow_lib_weekly 2.15.0.1.2.dev20240603
intel_extension_for_tensorflow_weekly 2.15.0.1.dev2024060
is there another step to do for the newer library gets used by default? or that was it?
please help to remove the "intel_extension_for_tensorflow" and "intel_extension_for_tensorflow_lib"
Hi, can I test that after doing some modelling first. it works (and not sometimes) for now.
I can tell already the update made the gpu use memory but uses the cpu to process. cpu went up 100% with 0 from gpu which was used to be using the gpu as xpu from the original build. but let me restart the wsl to confirm everything. my models went up from 5 sec training per epoch to 130 sec which is not what I expect.
the update was no longer using the gpu tho
this line was no longer in the logs
[2024-06-28 19:43:19.472249: I itex/core/wrapper/itex_gpu_wrapper.cc:38] Intel Extension for Tensorflow* GPU backend is loaded.
so it was purely using cpu now
I did remove all itex and just installed the weekly build. the gpu gets mounted again but all the errors came back with it too. back to 0
Hi @raevillena How can I reproduce your issue ?
Hi @raevillena How can I reproduce your issue ?
Hi @feng-intel this is the summary
Hardware setup:
Ubuntu 22.04 on WSL2
Host: Windows 11 enterprise
32Gb ram 3600ddr4
AMD 5700x CPU
Intel Arc A750 8GB
wsl2:
ubuntu22.04 official distro
(this runs on Microsoft special kernet
running uname -r
5.15.153.1-microsoft-standard-WSL2
from fresh installation: following steps here: https://github.com/intel/intel-extension-for-tensorflow/blob/main/docs/install/experimental/install_for_arc_gpu.md
sudo apt-get install -y gpg-agent wget
wget -qO - https://repositories.intel.com/gpu/intel-graphics.key |
sudo gpg --dearmor --output /usr/share/keyrings/intel-graphics.gpg
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu jammy/lts/2350 unified" | sudo tee /etc/apt/sources.list.d/intel-gpu-jammy.list
sudo apt-get update
then
sudo apt-get install \
intel-igc-cm \
intel-level-zero-gpu \
intel-opencl-icd \
level-zero \
libigc1 \
libigdfcl1 \
libigdgmm12
I needed to install the whole oneapi cause i needed the source setvars
wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/fdc7a2bc-b7a8-47eb-8876-de6201297144/l_BaseKit_p_2024.1.0.596.sh
sudo sh ./l_BaseKit_p_2024.1.0.596.sh
then
source /opt/intel/oneapi/setvars.sh
setting up my conda environment: https://intel.github.io/intel-extension-for-tensorflow/latest/docs/install/experimental/install_for_gpu_conda.html
curl https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -o Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
conda update conda
conda create -n itex -c intel intelpython3_full python=3.9
#I removed the version orig: conda create -n itex -c intel intelpython3_full==2023.2.0 python=3.9
activated my conda
conda activate itex
proceeded as documented
pip install --upgrade pip
pip install tensorflow==2.15.0
pip install intel-extension-for-tensorflow[xpu]
source /opt/intel/oneapi/compiler/latest/env/vars.sh
source /opt/intel/oneapi/mkl/latest/env/vars.sh
export path_to_site_packages=`python -c "import site; print(site.getsitepackages()[0])"`
bash ${path_to_site_packages}/intel_extension_for_tensorflow/tools/env_check.sh
but the output would say that there is no file or directory for env_check.sh cause there isn't in the latest version
then install the jupyter using these from here: https://www.intel.com/content/www/us/en/developer/articles/technical/running-tensorflow-stable-diffusion-on-intel-arc.html
pip install notebook
pip install keras tensorflow-datasets matplotlib ipywidgets
jupyter notebook
here is the sample model
import tensorflow as tf
base_model = tf.keras.applications.VGG16(include_top=False)
base_model.trainable = False
inputs = tf.keras.layers.Input(shape=(224, 224, 3), name="input_layer")
x = tf.keras.layers.experimental.preprocessing.Rescaling(1./255)(inputs)
x = base_model(inputs)
x = tf.keras.layers.GlobalAveragePooling2D(name="global_average_pooling_layer")(x)
outputs = tf.keras.layers.Dense(3, activation="softmax", name="output_layer")(x)
model_5 = tf.keras.Model(inputs, outputs)
model_5.compile(loss='categorical_crossentropy',
optimizer=tf.keras.optimizers.Adam(),
metrics=["accuracy"])
history5 = model_5.fit(train_data_50_test,
epochs=10,
steps_per_epoch=len(train_data_50_test),
validation_data=val_data_50_test,
validation_steps=int(0.5 * len(val_data_50_test)))
maybe you have a data there i cannot provide my own.
is there something I didn't mention except the exact logs? I don't want to redo the setup for the meantime I switched to use the cpu instead for now while waiting for a development on this.
I am having memory issue with the running things. Everything works except that training bigger data crashes the kernel of jupyter notebook.
System Desktop
Setup: miniconda3 on itex environment
running model fit with train data results to (especially with vgg, resnet works fine):
it crashes no matter what I do when it tries to allocated that 14gb in the curr_region_allocation
Global mem shows:
btw my version of itex didnt came with
check_env.sh
so I cant run that, I just know it works cause it does and it doesnt.In jupyter the device is recognized as this
Also the other setups I can read about issues of bfc allocator uses the one that came along with the tensorflow while mine was coming from itex build files.
I could see that the repo is available for rebuilding and there might be chance to find what is happening there but I dont have the time and ability to do so.
I just wanna know if there what am I missing here since it was able allocate almost 8gb memory but unable to expand it.
I also tried exporting this to the conda environment with no effect
export ITEX_LIMIT_MEMORY_SIZE_IN_MB=4096
I said earlier that it works, yes I can train a resnet model blazingly fast compared to tesla t4 in colab but running it twice give the memory error.
what is consistent is that it tries to allocate that curr region allocation bytes: 14975071232 that value was very consistent. which I dont know why. It makes sense the the oom happens with that but why allocate 14gb when tf doesnt even need that much for the current workload.