OOM when trying to use Arc 350M

ra-health-ai commented 1 year ago

To the very least I need some advice on how to optimize my current configuration. I do have a 2022 Samsung Galaxy Book2 with an integrated Intel Iris Xe and a discrete Arc 350M. I am aware that the 350 is modest in performance compared to other Arc GPUs. I still wanted to make Tensorflow and PyTorch take advantage of it. Since this is the Intel extension for Tensorflow forum, I will focus on it though I will reference my experience with PyTorch since the discrepancies could be useful in determining if I am hitting any defects. I found the information on setup on the Intel website and the information contained in the issues on this page very useful.

Once all set up, I am running Tensorflow and stable diffusion (based on the web article) out of a conda environment on WSL2 and Windows 11. The GPU drivers are the 05/10/2023 version (latest at the time of the post), 31.0.101.4369. WSL Ubuntu 22.04.2 LTS (GNU/Linux 5.15.90.1-microsoft-standard-WSL2 x86_64). Intel Arc Control shows Resizable BAR enabled for the Arc 350. Tensorflow detects both the integrated card as well as the 350 as XPU devices.

_$ python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('XPU'))" 2023-05-21 12:49:43.056198: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. 2023-05-21 12:49:43.087444: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-05-21 12:49:43.635289: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 2023-05-21 12:49:45.558317: I itex/core/devices/gpu/itex_gpu_runtime.cc:129] Selected platform: Intel(R) Level-Zero 2023-05-21 12:49:45.558880: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device. 2023-05-21 12:49:45.558989: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device. 2023-05-21 12:49:45.613080: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:266] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected [PhysicalDevice(name='/physical_device:XPU:0', device_type='XPU'), PhysicalDevice(name='/physical_device:XPU:1', devicetype='XPU')]

The first issue I am running into is that when both cards are enabled the operations trying to access the XPU fail. $ python3 -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"

_2023-05-21 12:52:24.489903: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform XPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support. 2023-05-21 12:52:24.489949: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform XPU ID 1, defaulting to 0. Your kernel may not have been built with NUMA support. 2023-05-21 12:52:24.490034: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:XPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: XPU, pci bus id: ) 2023-05-21 12:52:24.494335: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:XPU:1 with 0 MB memory) -> physical PluggableDevice (device: 1, name: XPU, pci bus id: ) Abort was called at 717 line in file: ./shared/source/os_interface/windows/wddm_memorymanager.cpp Aborted

This may be a WSL issue for supporting two GPUs, since the integrated Intel Iris Xe is considered an XPU device. I found this article mentioning something similar https://medium.com/@tonymongkolsmai/debugging-wsl-and-multiple-gpu-issues-59f28a8cf5d If I disable the Arc 350 from the Device manager, I can successfully run Stable Diffusion as mentioned by this article. https://www.intel.com/content/www/us/en/developer/articles/technical/running-tensorflow-stable-diffusion-on-intel-arc.html

The integrated card is used as an XPU device and is 2-3 times faster than CPU only. Since I was curious about what the 350 can do in terms of performance, I disabled the Iris Xe card (goodbye using a large monitor) and enabled the Arc 350 M card looking forward to the performance improvement. The card was recognized and the initial vector operation worked (print(tf.reduce_sum(tf.random.normal([1000, 1000])))),

2023-05-21 13:11:19.946828: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform XPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support. 2023-05-21 13:11:19.946888: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:XPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: XPU, pci bus id: ) tf.Tensor(921.372, shape=(), dtype=float32)

Stable diffusion though immediately throws a memory related exception. This happens immediately, looking at the GPU memory, utilization was 0%. 2023-05-13 17:10:55.926325: E itex/core/devices/bfc_allocator.cc:98] Allocator ran out of memory trying to allocate 117964800 Bytes (rounded to 117964800 Bytes) If you need help, create an issue at https://github.com/intel/intel-extension-for-tensorflow/issues ResourceExhaustedError: {{function_node wrappedAddV2device/job:localhost/replica:0/task:0/device:XPU:0}} OOM when allocating tensor with shape[3,3,2560,1280] and type float on /job:localhost/replica:0/task:0/device:XPU:0 by allocator Simple allocator [Op:AddV2]

Time to mention PyTorch. I was able to run Stable Diffusion taking advantage of the Intel Iris Xe as an XPU following the article below when the Arc 350 was disabled. https://www.intel.com/content/www/us/en/developer/articles/technical/stable-diffusion-with-intel-arc-gpus.html

When I swapped to the Iris Xe disabled and Arc 350 disabled, the memory of the GPU starts being used for a few minutes and is growing to about 3 GB however then an out of resource runtime error is thrown. The total shared GPU memory is 16 GB. The Intel Iris Xe card took advantage of it. The behavior for the Arc 350 is different. RuntimeError: Native API failed. Native API returns: -5 (PI_ERROR_OUT_OF_RESOURCES) -5 (PI_ERROR_OUT_OF_RESOURCES)

The only reason for mentioning PyTorch is because while the memory error outcome is the same, and results in an abort, on Tensorflow the OOM error is immediate. I see how in Tensorflow there is a mention of 0 MB (/job:localhost/replica:0/task:0/device:XPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: XPU, pci bus id: ) . I ran env_check.sh and everything checks out ok. Check Environment for Intel(R) Extension for TensorFlow*...

======================== Check Python ========================

python3.9 is installed.

==================== Check Python Passed =====================

========================== Check OS ==========================

OS ubuntu:22.04 is Supported.

====================== Check OS Passed =======================

====================== Check Tensorflow ======================

tensorflow2.12 is installed.

================== Check Tensorflow Passed ===================

=================== Check Intel GPU Driver ===================

Intel(R) graphics runtime intel-level-zero-gpu-1.3.25593.18-601 is installed. Intel(R) graphics runtime intel-opencl-icd-23.05.25593.18-601 is installed. Intel(R) graphics runtime level-zero-1.9.4+i589 is installed. Intel(R) graphics runtime libigc1-1.0.13230.8-600 is installed. Intel(R) graphics runtime libigdfcl1-1.0.13230.8-600 is installed. Intel(R) graphics runtime libigdgmm12-22.3.5-601 is installed.

=============== Check Intel GPU Driver Finshed ================

===================== Check Intel OneApi =====================

Intel(R) OneAPI DPC++/C++ Compiler is installed. Intel(R) OneAPI Math Kernel Library is installed.

================= Check Intel OneApi Passed ==================

========================== Check Devices Availability ==========================

2023-05-21 13:16:03.441586: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. 2023-05-21 13:16:03.468314: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-05-21 13:16:04.126807: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 2023-05-21 13:16:05.147352: I itex/core/devices/gpu/itex_gpu_runtime.cc:129] Selected platform: Intel(R) Level-Zero 2023-05-21 13:16:05.147632: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device. 2023-05-21 13:16:05.147647: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device. 2023-05-21 13:16:05.230183: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:266] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

====================== Check Devices Availability Passed =======================

The intel extension for tensorflow version installed is 1.2.0. python -c "import intel_extension_for_tensorflow as itex; print(itex.version)" 1.2.0

Thank you

wangkl2 commented 1 year ago

Hi @ra-health-ai,

The OOM issue you met both on TF and PyTorch seems derived from that GPU memory resource that can be allocated has been exhausted if using the default FP32 precision. The Stable Diffusion v1.4 model has about 1B parameters in total, which accounts for about 4GB GPU memory in FP32 mode, while Arc 350M only has 4GB physical memory. But when you are running with Intel Iris Xe GPU, it takes the CPU host memory instead, which is enough in your case. To avoid OOM on Arc 350M, FP16 precision is recommended to reduce memory consumption. You can consider using ITEX advanced AMP instead via "export ITEX_AUTO_MIXED_PRECISION=1", or optionally using keras mixed precision via adding "keras.mixed_precision.set_global_policy('mixed_float16')" in the code. Could you please have a try?

For further speedup on stable diffusion, please also refer to our example: https://github.com/intel/intel-extension-for-tensorflow/tree/main/examples/stable_diffussion_inference. Please be noted to build ITEX from source via the latest branch to use our latest customized op like itex.ops.GroupNormalization.

By the way, the output of "(/job:localhost/replica:0/task:0/device:XPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: XPU, pci bus id: )" you observed meets the expectation. It is the behavior of stock TensorFlow*. For pluggable device, which is not NVIDIA GPUs, it will not print out the actual PCIe bus id and the GPU memory capacity of the GPU. We can get the memory capacity through other tools like clinfo / Intel xpu-manager etc.

ra-health-ai commented 1 year ago

Thank you for the prompt answer @wangkl2. I didn't realize that the default precision for the StableDiffusion Tensorflow example was FP32. The PyTorch example actually mentions float16. Unfortunately, while running SD with "keras.mixed_precision.set_global_policy('mixed_float16')", which was definitely active based on the output messages, it still ran out of memory. My main goal is to make sure that the ARC 350m, while limited, is configured properly, since many of the models I will be dealing with will be more modest in size. I would still like your, or your group's assistance with two more issues. Any hints are welcome.

ARC350 has 4 GB of dedicated RAM. I understand that not all will be available. With the integrated Iris Xe disabled (otherwise I get crashes - this will be the second issue), I attempted to just allocate memory towards the dedicated RAM of the Arc 350.

_import tensorflow as tf tf.debugging.set_log_device_placement(True) with tf.device('/XPU:0'): memalloc = tf.zeros(10241024603)

I have pasted above the highest number (603) I was able to achieve, which I believe generates a memory usage around 2.35 GB. This looks low to me. Am I hitting a default setting that is more conservative that I can tweak? In PyTorch that number is closer to a bit over 3 GB.

As I have stated in my original post, both the Intel Iris Xe and the Arc 350m are detected as XPU devices. However when both are enabled I get the error below. It is the same line number regardless what operation is being executed. _Abort was called at 717 line in file: ./shared/source/os_interface/windows/wddm_memorymanager.cpp Aborted

I understand that this may not be related to the Intel extension. Any things that I could try though? Disabling the Iris Xe is annoying, since I have to then rely only on my laptop screen. Is there a way that the detection of the Iris Xe as a XPU device can be avoided via configuration without disabling it?

Thank you

wangkl2 commented 1 year ago

Hi @ra-health-ai,

For your 1st question, via "tf.zeros(10241024603)" creating tensor seems requiring 10,241,024,603*4Byte=41GB GPU memory with default FP32 dtype. May I ask how did you or which tool did you use to get the "2.35GB" for the maximum allocation memory? Thanks.

For your 2nd question, you can try to use the environment variable of "export ZE_AFFINITY_MASK=[value]" before running workloads to specify and force the driver to only report devices by values (such as 0 or 1, based on the index of the devices via clinfo -l or sycl-ls).

ra-health-ai commented 1 year ago

Hello @wangkl2,

Sorry, I didn't realize that the copy and paste got rid of the '*' (multiplication) operators. I was going for 1024x1024x603 = 632,291,328 which further multiplied by 4Byte gets us into the 2.35 GB territory. That's the highest I could get. I was expecting something closer to 3 GB. I would welcome any additional suggestions.

I will try the ZE_AFFINITY__MASK approach, and report back later.

Thank you

ra-health-ai commented 1 year ago

It took a while to figure out the proper mask however ZE_AFFINITY_MASK was exactly what I was after. Still no idea why the crashes in wddm_memory_manager.cpp occur however this approach avoids it and allows only one card to be active at the time without disabling it from the device manager. When the ARC GPU is sought, the integrated video card can still be used for driving the external monitor.

If you have additional information and suggestions to share about the dedicated GPU memory limits for ARC 350m when Tensorflow is used, that would be great. Thank you

ra-health-ai commented 1 year ago

It looks like the issue that I am experiencing has been encountered by other users. It is not specific to Intel GPUs. https://github.com/tensorflow/tensorflow/issues/55818 https://github.com/tensorflow/tensorflow/issues/22623#issuecomment-430482857 Unfortunately the workaround of setting a logical device with a memory limit closer to the total dedicated GPU/XPU memory is not supported by the Intel extension. I couldn't find any other workarounds, maybe you can provide more hints. PyTorch is more generous with 3.1 GB available however this is more than likely unrelated to the Intel extensions.

Thank you

guizili0 commented 1 year ago

@ra-health-ai yes, you are right, TF memory pool will try to use all GPU memory but leave 800MB for other application. We follow TF work also leave 800MB for others, detail in https://github.com/intel/intel-extension-for-tensorflow/blob/main/itex/core/devices/bfc_allocator.cc#L25 you can try to change code here and did a quick try. Thanks for you good suggestion about the momory_limit here, will talk with team and try to find a proper way to let user can utilize more memory.

ra-health-ai commented 1 year ago

Thank you @guizili0. It would be great if this could be configured. Just an empirical observation, since I haven't looked at the code and how the allocator class is used. 4 GB is 4096 MB, the closest I got in terms of memory allocation is around 2412 MB. This number is actually close to 4096 - 2x800, which is 2496. My question, since I do have two devices, is this memory reservation subtracted twice? I did actually use the ZE_AFFINITY_MASK, so only one device is active, the Arc350m. In any case, while this is Tensorflow, I assume some code is shared for the PyTorch extension. In PyTorch I get 3212 MB reported by the extension, which is close indeed to 4096 - 800 = 3296. So one thing to check and a suggestion for configuration. It looks like 800 MB are reserved twice in my case. Maybe I didn't understand correctly, are you saying that the extension will reserve 800 MB and Tensorflow another 800 MB? Or is the extension reserving what looks to be 800 MB twice? The suggestion is indeed to be able to reduce this hardcoded number via an environment variable or other configuration option. Thank you

guizili0 commented 1 year ago

@ra-health-ai please help to share the clinfo | grep "Global memory size" result, usually our driver would reserve some memory.

ra-health-ai commented 1 year ago

$ clinfo | grep "Global memory size" Global memory size 16578273280 (15.44GiB) Global memory size 16578273280 (15.44GiB) Global memory size 16578273280 (15.44GiB) Global memory size 3370541056 (3.139GiB)

guizili0 commented 1 year ago

@ra-health-ai thanks, so IPEX and ITEX can only get 3.139GiB from driver.

ra-health-ai commented 1 year ago

Thank you @guizili0. I will reiterate my understanding of the situation and have an additional set of questions. It looks like the number reported of 3370541056 Bytes is exactly what IPEX is reporting as available memory for the Arc350m GPU card. $ipex.xpu.get_device_properties(0) _DeviceProperties(name='Intel(R) Graphics [0x5694]', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=0, total_memory=3214MB, max_compute_units=96) If I subtract an additional 800 MB, which is the hardcoded value you pointed in the code of bfc_allocator, I obtain 2414 MB, and in my testing the highest value I could allocate via Tensorflow for the GPU was 2412 MB. So it all makes sense. It looks like both IPEX and ITEX start at 3214 MB though ITEX reserves an additional 800 MB. In terms of a suggestion it would be great if we could make this value configurable for ITEEX (of course with proper checks in place). Even the folks who have higher end Intel cards will probably welcome this change.

My question is related to the driver reserving an additional 882 MB. First I want to be sure, which driver are you referring to? Is this the Intel GPU driver, or the oneAPI toolkit? Why do you think is only 3214 MB available to IPEX and ITEX, and can this be tweaked via configuration?

Thank you

wangkl2 commented 1 year ago

@ra-health-ai Thanks for your observation.

"My question is related to the driver reserving an additional 882 MB. First I want to be sure, which driver are you referring to? Is this the Intel GPU driver, or the oneAPI toolkit?"

--> This refers to Intel GPU driver, specifically, the UMD: opencl/level-zero compute runtime, that reserves some memory to keep resident that are necessary for the system to function properly. Currently, for windows platform, the available global memory is 80% of the physical memory (https://github.com/intel/compute-runtime/blob/master/shared/source/os_interface/windows/wddm_memory_manager.cpp#L830) while the fraction for Linux platform is 95%. I think this approximately aligns to the Global memory size 3.193GiB reported in clinfo in your case.

"Why do you think is only 3214 MB available to IPEX and ITEX, and can this be tweaked via configuration?"

-->For framework level such as ITEX/IPEX or other oneAPI components, which is on top of the driver, naturally can only allocate memory based on the Global memory size of compute runtime. If it is possible, switching to a native Linux platform would get larger allocated memory. We will also talk with driver team whether the permitted memory utilization can be increased on Windows platform.

ra-health-ai commented 1 year ago

Thank you for patience in providing the details. I believe we can close this issue. Though I would like to recap the findings and suggestions.

there seems to be a bug in the driver when more than one XPU device is recognized, in my case the integrated Intel Xe and the Arc 350m. Maybe this was addressed already, since there have been recent commits to that area of the code.
As you mentioned, it would be useful in my opinion to be able to increase the memory utilization on Windows, by decreasing the reservation in the driver. I am sure there was a reason for this choice.
Allow the memory reservation in ITEX to also be configurable rather than the current value of 800 MB.

Thank you

wangkl2 commented 1 year ago

Hi @ra-health-ai,

For driver part, the available GPU global memory on both Linux and Windows is recently extended to 98%. Details can be found here: https://github.com/intel/compute-runtime/commit/076e0a0fa877c50cbf66ca6946a68b306ae9cb1c. Future driver version could be expected to include this commit.

For ITEX part, please keep monitoring our repository.

intel / intel-extension-for-tensorflow

OOM when trying to use Arc 350M #34