Closed rahulunair closed 1 year ago
Honestly, it might be that because the workload does FP64-related operations in the background if I were to guess; FP64 is not supported natively by ARC GPUs (i.e. they literally don't have FP64 cores in their microarch), and therefore each time an FP64 operation is executed it takes a very long time for the driver to emulate the operations in the background. My suspicions come from my experience that I had building ITEX with and without FP64 support as stated here; @rahulunair did you have to enable FP64 emulation per instructions from here for the SD TF port to run?
@tedliosu I guess not, in ITEX official release binary, FP64 emulation should not work as mentioned in another issue (-cl-poison-unsupported-fp64-kernels
is used to remove FP64 kernel).
For this issue's description, seems CPU activity is dominated, we will try to reproduce it first. @rahulunair do you see similar behavior when running this in other GPUs (like CPU activity is dominated)?
@tedliosu I guess not, in ITEX official release binary, FP64 emulation should not work as mentioned in another issue (
-cl-poison-unsupported-fp64-kernels
is used to remove FP64 kernel). For this issue's description, seems CPU activity is dominated, we will try to reproduce it first. @rahulunair do you see similar behavior when running this in other GPUs (like CPU activity is dominated)?
@yiqianglee I ran the same script that rahulunair ran on my Laptop with Intel i5 11400H's iGPU in it (I couldn't get the script to work on my Nvidia 3050 mobile no matter how I tweaked it bc I kept running out of VRAM no matter what I did :confused:), albeit with some minor tweaking because I also have an Nvidia mobile GPU on my machine; here's the script that I ran fyi:
# coding: utf-8
import os
# Banish Nvidia to the nether regions >:)
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
import intel_extension_for_tensorflow as itex
import tensorflow as tf
from PIL import Image
from stable_diffusion_tf.stable_diffusion import StableDiffusion
def set_backend(backend="GPU"):
#auto_mixed_precision_options = itex.AutoMixedPrecisionOptions()
#auto_mixed_precision_options.data_type = itex.FLOAT16
#graph_options = itex.GraphOptions(auto_mixed_precision_options=auto_mixed_precision_options)
#graph_options.auto_mixed_precision = itex.ON
#config = itex.ConfigProto(graph_options=graph_options)
#itex.set_backend(backend, config)
itex.set_backend(backend)
if __name__ == "__main__":
set_backend()
prompt = "Red air balloons in the blue sky evening golden rays from the sun paris"
generator = StableDiffusion(
img_height=512,
img_width=512,
jit_compile=False,
)
for _ in range(1):
img = generator.generate(
prompt,
num_steps=50,
unconditional_guidance_scale=7.5,
temperature=1,
batch_size=1,
)
Image.fromarray(img[0]).save("./sd_tf_fp32.png")
And here's a link to a video of the script running on the Intel iGPU on my laptop along with some (hopefully) useful stats that were displayed along with me running the script. :smile:
Btw I installed ITEX from source because I was testing whether or not an issue that I was facing was due to a lack of float16 support from both ITEX and oneDNN, and I had to build ITEX without -cl-poison-unsupported-fp64-kernels
because otherwise some of my own TF scripts wouldn't run with that build option in place; thus I also had to enable FP64 emulation per here before I ran the script as otherwise the script would segfault. Hopefully that doesn't change things too much for you to help them diagnose the root of this issue. :+1:
From the video, I think memory movement between CPU and GPU is unlike the issue, because I see there is almost no activity on blitter, which is copy engine used to do H2D/D2H copy, but I see CPU is busy, normally, if this is a GPU bound app, we should not see so many activities on CPU side. Anyway, thanks for your information @tedliosu , we will have a look.
@tedliosu @yiqianglee so, it was some issue with the drivers on my end.
I had customized my config pretty heavily, used mainline 6.0 kernel, built the drivers etc.. I should have followed the [documentation] (https://dgpu-docs.intel.com/installation-guides/ubuntu/ubuntu-jammy-arc.html) on how to setup arc on linux like a sane person :). So, i finally went ahead, wiped my system, installed Ubuntu 22.04 and setup up the drivers as per the docs and the results for stable diffusion on Arc 770 is as below in FP32 mode:
In [12]: for _ in range(3):
...: img = generator.generate(
...: prompt,
...: num_steps=50,
...: unconditional_guidance_scale=7.5,
...: temperature=1,
...: batch_size=1,
...: )
...:
0 1: 100%|███████████████████████████████████████████████████████| 50/50 [00:29<00:00, 1.68it/s]
0 1: 100%|███████████████████████████████████████████████████████| 50/50 [00:29<00:00, 1.69it/s]
0 1: 100%|███████████████████████████████████████████████████████| 50/50 [00:29<00:00, 1.68it/s]
In [13]:
It takes an average of around ~29 seconds, not the 7 minutes as before :D.. I am now looking at the profiles to see if this can be further improved to under 20 seconds..(if there are any pointers on this, I am all ears..)
If anyone wants the drivers, kernel version etc that worked, here you go:
05:11:35 rahul@rahul-a770m ~ → uname -r
5.17.0-1019-oem
usermod drivers:
intel-level-zero-gpu (1.3.23937+i449~u22.04).
intel-opencl-icd (22.32.23937+i449~u22.04).
level-zero (1.8.5+i449~u22.04).
libigdgmm12 (22.1.7+i449~u22.04).
@rahulunair I'm happy to see the new result. :) You can try ITEX profiler (profiler) to see which op is the hotspot, internally we see some opportunity also, work in progress.
Thanks @yiqianglee yup, trying out the profiler.. I think this issue can be closed now, thank for your support!
I have been trying out few examples and was able to successfully run stable diffusion[1] inference code on an Arc 770 but the execution is very slow, could you please help me debug why it is so?
The most time is taken for XPU offload and the process gets stuck for more than 3 minutes after an XPU TensorFlow device is created:
After inference is complete, there is again a delay of about 2 minutes, where the process is busy but the XPU is not, I suspect it might be due to some data movement operation between the CPU and the XPU device..?
Code to replicate:
With
ITEX_VERBOSE=1
, the logs look like:[1]. https://github.com/divamgupta/stable-diffusion-tensorflow