intel / intel-extension-for-tensorflow

Intel® Extension for TensorFlow*
Other
309 stars 36 forks source link

Memory Bug #71

Open raevillena opened 4 days ago

raevillena commented 4 days ago

I am having memory issue with the running things. Everything works except that training bigger data crashes the kernel of jupyter notebook.

System Desktop

Ubuntu 22.04 on WSL2
Host: Windows 11
32Gb ram
AMD 5700x CPU
Intel Arc A750 8GB

Setup: miniconda3 on itex environment

# pip list |grep tensorflow
intel_extension_for_tensorflow     2.15.0.0
intel_extension_for_tensorflow_lib 2.15.0.0.2
tensorflow                         2.15.0
tensorflow-datasets                4.9.3
tensorflow-estimator               2.15.0
tensorflow-io-gcs-filesystem       0.37.0
tensorflow-metadata                1.15.0

running model fit with train data results to (especially with vgg, resnet works fine):

2024-06-27 16:10:04.010287: I external/tsl/tsl/framework/bfc_allocator.cc:1122] Sum Total of in-use chunks: 513.70MiB
2024-06-27 16:10:04.010290: I external/tsl/tsl/framework/bfc_allocator.cc:1124] Total bytes in pool: 982550528 memory_limit_: 7487535513 available bytes: 6504984985 curr_region_allocation_bytes_: 14975071232
2024-06-27 16:10:04.010295: I external/tsl/tsl/framework/bfc_allocator.cc:1129] Stats:
Limit:                      7487535513
InUse:                       538648576
MaxInUse:                    956967680
NumAllocs:                         297
MaxAllocSize:                485714176
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0

it crashes no matter what I do when it tries to allocated that 14gb in the curr_region_allocation

Global mem shows:

#clinfo | grep "Global memory size"
Global memory size                              16723046400 (15.57GiB)
Global memory size                              8319483904 (7.748GiB)

btw my version of itex didnt came with check_env.sh so I cant run that, I just know it works cause it does and it doesnt.

In jupyter the device is recognized as this

1 Physical GPUs, [LogicalDevice(name='/device:XPU:0', device_type='XPU')]

Also the other setups I can read about issues of bfc allocator uses the one that came along with the tensorflow while mine was coming from itex build files.

I could see that the repo is available for rebuilding and there might be chance to find what is happening there but I dont have the time and ability to do so.

I just wanna know if there what am I missing here since it was able allocate almost 8gb memory but unable to expand it.

I also tried exporting this to the conda environment with no effect export ITEX_LIMIT_MEMORY_SIZE_IN_MB=4096

I said earlier that it works, yes I can train a resnet model blazingly fast compared to tesla t4 in colab but running it twice give the memory error.

what is consistent is that it tries to allocate that curr region allocation bytes: 14975071232 that value was very consistent. which I dont know why. It makes sense the the oom happens with that but why allocate 14gb when tf doesnt even need that much for the current workload.

raevillena commented 4 days ago

update: ok so I was able to solve it

after reading all issues and documents I could here is what I did coming from a reboot of the wsl

in your terminal do this without starting the virtual environment:

export ITEX_LIMIT_MEMORY_SIZE_IN_MB=1024

if you have fp64 issues do this too

export OverrideDefaultFP64Settings=1
export IGC_EnableDPEmulation=1

then forcefully source your vars (still in the main environment) source /opt/intel/oneapi/setvars.sh --force

you may now activate conda environment, and set the variables again all of them if you may:

export OverrideDefaultFP64Settings=1
export IGC_EnableDPEmulation=1
export ITEX_LIMIT_MEMORY_SIZE_IN_MB=1024

you may check using printenv this will list the variables in the conda environment

then you may now start the usage of tf. in my case: jupyter notebook

all this happened without reinstallation of my system.

guizili0 commented 3 days ago

@raevillena Can you help to check if our latest weekly release still has this issue? thanks.

pip install --upgrade intel-extension-for-tensorflow-weekly[xpu] -f https://developer.intel.com/itex-whl-weekly

raevillena commented 3 days ago

I just tried right now without exporting any env variables i mentioned above but still give me

NotFoundError: libsycl.so.7: cannot open shared object file: No such file or directory this can be solved using source /opt/intel/oneapi/setvars.sh --force

Now I tried solving it with just setvars without setting the limit memory but no. the memory bug is still there.

but the fp64 emulation is now working without setting env

guizili0 commented 3 days ago

Can you help to share the result of pip list | grep intel_extension_for_tensorflow

raevillena commented 3 days ago

Hi here it is,

(itex) rae@DESKTOP-URAMFL5:~$ pip list | grep intel_extension_for_tensorflow
intel_extension_for_tensorflow            2.15.0.0
intel_extension_for_tensorflow_lib        2.15.0.0.2
intel_extension_for_tensorflow_lib_weekly 2.15.0.1.2.dev20240603
intel_extension_for_tensorflow_weekly     2.15.0.1.dev2024060

is there another step to do for the newer library gets used by default? or that was it?

guizili0 commented 3 days ago

please help to remove the "intel_extension_for_tensorflow" and "intel_extension_for_tensorflow_lib"

raevillena commented 3 days ago

Hi, can I test that after doing some modelling first. it works (and not sometimes) for now.

raevillena commented 3 days ago

I can tell already the update made the gpu use memory but uses the cpu to process. cpu went up 100% with 0 from gpu which was used to be using the gpu as xpu from the original build. but let me restart the wsl to confirm everything. my models went up from 5 sec training per epoch to 130 sec which is not what I expect.

raevillena commented 3 days ago

the update was no longer using the gpu tho this line was no longer in the logs [2024-06-28 19:43:19.472249: I itex/core/wrapper/itex_gpu_wrapper.cc:38] Intel Extension for Tensorflow* GPU backend is loaded.

so it was purely using cpu now

raevillena commented 3 days ago

I did remove all itex and just installed the weekly build. the gpu gets mounted again but all the errors came back with it too. back to 0

feng-intel commented 13 hours ago

Hi @raevillena How can I reproduce your issue ?

raevillena commented 12 hours ago

Hi @raevillena How can I reproduce your issue ?

Hi @feng-intel this is the summary

Hardware setup:

Ubuntu 22.04 on WSL2
Host: Windows 11 enterprise
32Gb ram 3600ddr4
AMD 5700x CPU
Intel Arc A750 8GB

wsl2:

ubuntu22.04 official distro
(this runs on Microsoft special kernet

running uname -r 5.15.153.1-microsoft-standard-WSL2

from fresh installation: following steps here: https://github.com/intel/intel-extension-for-tensorflow/blob/main/docs/install/experimental/install_for_arc_gpu.md

sudo apt-get install -y gpg-agent wget
wget -qO - https://repositories.intel.com/gpu/intel-graphics.key | 
sudo gpg --dearmor --output /usr/share/keyrings/intel-graphics.gpg
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu jammy/lts/2350 unified" | sudo tee /etc/apt/sources.list.d/intel-gpu-jammy.list
sudo apt-get update

then

sudo apt-get install \
    intel-igc-cm \
    intel-level-zero-gpu \
    intel-opencl-icd \
    level-zero \
    libigc1 \
    libigdfcl1 \
    libigdgmm12

I needed to install the whole oneapi cause i needed the source setvars

wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/fdc7a2bc-b7a8-47eb-8876-de6201297144/l_BaseKit_p_2024.1.0.596.sh
sudo sh ./l_BaseKit_p_2024.1.0.596.sh

then

source /opt/intel/oneapi/setvars.sh

setting up my conda environment: https://intel.github.io/intel-extension-for-tensorflow/latest/docs/install/experimental/install_for_gpu_conda.html

curl https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -o Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
conda update conda
conda create -n itex -c intel intelpython3_full python=3.9
#I removed the version orig: conda create -n itex -c intel intelpython3_full==2023.2.0 python=3.9

activated my conda conda activate itex proceeded as documented

pip install --upgrade pip
pip install tensorflow==2.15.0
pip install intel-extension-for-tensorflow[xpu]
source /opt/intel/oneapi/compiler/latest/env/vars.sh
source /opt/intel/oneapi/mkl/latest/env/vars.sh
export path_to_site_packages=`python -c "import site; print(site.getsitepackages()[0])"`
bash ${path_to_site_packages}/intel_extension_for_tensorflow/tools/env_check.sh

but the output would say that there is no file or directory for env_check.sh cause there isn't in the latest version

then install the jupyter using these from here: https://www.intel.com/content/www/us/en/developer/articles/technical/running-tensorflow-stable-diffusion-on-intel-arc.html

pip install notebook
pip install keras tensorflow-datasets matplotlib ipywidgets
jupyter notebook

here is the sample model

import tensorflow as tf
base_model = tf.keras.applications.VGG16(include_top=False)
base_model.trainable = False
inputs = tf.keras.layers.Input(shape=(224, 224, 3), name="input_layer")
x = tf.keras.layers.experimental.preprocessing.Rescaling(1./255)(inputs)
x = base_model(inputs)
x = tf.keras.layers.GlobalAveragePooling2D(name="global_average_pooling_layer")(x)
outputs = tf.keras.layers.Dense(3, activation="softmax", name="output_layer")(x)
model_5 = tf.keras.Model(inputs, outputs)
model_5.compile(loss='categorical_crossentropy',
              optimizer=tf.keras.optimizers.Adam(),
              metrics=["accuracy"])
history5 = model_5.fit(train_data_50_test,
                                 epochs=10,
                                 steps_per_epoch=len(train_data_50_test),
                                 validation_data=val_data_50_test,
                                 validation_steps=int(0.5 * len(val_data_50_test)))

maybe you have a data there i cannot provide my own.

is there something I didn't mention except the exact logs? I don't want to redo the setup for the meantime I switched to use the cpu instead for now while waiting for a development on this.