Stable difussion inference very slow on Arc A770

rahulunair commented 1 year ago

I have been trying out few examples and was able to successfully run stable diffusion[1] inference code on an Arc 770 but the execution is very slow, could you please help me debug why it is so?

The most time is taken for XPU offload and the process gets stuck for more than 3 minutes after an XPU TensorFlow device is created:

→ python sd_tf.py 
2022-11-04 08:59:45.881192: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-04 08:59:46.062025: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-11-04 08:59:46.093434: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/env/intel/oneapi/vpl/2022.2.0/lib:/opt/env/intel/oneapi/tbb/2021.7.0/env/../lib/intel64/gcc4.8:/opt/env/intel/oneapi/mpi/2021.7.0//libfabric/lib:/opt/env/intel/oneapi/mpi/2021.7.0//lib/release:/opt/env/intel/oneapi/mpi/2021.7.0//lib:/opt/env/intel/oneapi/mkl/2022.2.0/lib/intel64:/opt/env/intel/oneapi/ipp/2021.6.1/lib/intel64:/opt/env/intel/oneapi/debugger/2021.7.0/gdb/intel64/lib:/opt/env/intel/oneapi/debugger/2021.7.0/libipt/intel64/lib:/opt/env/intel/oneapi/debugger/2021.7.0/dep/lib:/opt/env/intel/oneapi/dal/2021.7.0/lib/intel64:/opt/env/intel/oneapi/compiler/2022.2.0/linux/lib:/opt/env/intel/oneapi/compiler/2022.2.0/linux/lib/x64:/opt/env/intel/oneapi/compiler/2022.2.0/linux/lib/oclfpga/host/linux64/lib:/opt/env/intel/oneapi/compiler/2022.2.0/linux/compiler/lib/intel64_lin:/opt/env/intel/oneapi/ccl/2021.7.0/lib/cpu_gpu_dpcpp:/usr/lib/x86_64-linux-gnu:
2022-11-04 08:59:46.093455: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-11-04 08:59:46.130301: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-11-04 08:59:46.907078: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/env/intel/oneapi/vpl/2022.2.0/lib:/opt/env/intel/oneapi/tbb/2021.7.0/env/../lib/intel64/gcc4.8:/opt/env/intel/oneapi/mpi/2021.7.0//libfabric/lib:/opt/env/intel/oneapi/mpi/2021.7.0//lib/release:/opt/env/intel/oneapi/mpi/2021.7.0//lib:/opt/env/intel/oneapi/mkl/2022.2.0/lib/intel64:/opt/env/intel/oneapi/ipp/2021.6.1/lib/intel64:/opt/env/intel/oneapi/debugger/2021.7.0/gdb/intel64/lib:/opt/env/intel/oneapi/debugger/2021.7.0/libipt/intel64/lib:/opt/env/intel/oneapi/debugger/2021.7.0/dep/lib:/opt/env/intel/oneapi/dal/2021.7.0/lib/intel64:/opt/env/intel/oneapi/compiler/2022.2.0/linux/lib:/opt/env/intel/oneapi/compiler/2022.2.0/linux/lib/x64:/opt/env/intel/oneapi/compiler/2022.2.0/linux/lib/oclfpga/host/linux64/lib:/opt/env/intel/oneapi/compiler/2022.2.0/linux/compiler/lib/intel64_lin:/opt/env/intel/oneapi/ccl/2021.7.0/lib/cpu_gpu_dpcpp:/usr/lib/x86_64-linux-gnu:
2022-11-04 08:59:46.907207: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/env/intel/oneapi/vpl/2022.2.0/lib:/opt/env/intel/oneapi/tbb/2021.7.0/env/../lib/intel64/gcc4.8:/opt/env/intel/oneapi/mpi/2021.7.0//libfabric/lib:/opt/env/intel/oneapi/mpi/2021.7.0//lib/release:/opt/env/intel/oneapi/mpi/2021.7.0//lib:/opt/env/intel/oneapi/mkl/2022.2.0/lib/intel64:/opt/env/intel/oneapi/ipp/2021.6.1/lib/intel64:/opt/env/intel/oneapi/debugger/2021.7.0/gdb/intel64/lib:/opt/env/intel/oneapi/debugger/2021.7.0/libipt/intel64/lib:/opt/env/intel/oneapi/debugger/2021.7.0/dep/lib:/opt/env/intel/oneapi/dal/2021.7.0/lib/intel64:/opt/env/intel/oneapi/compiler/2022.2.0/linux/lib:/opt/env/intel/oneapi/compiler/2022.2.0/linux/lib/x64:/opt/env/intel/oneapi/compiler/2022.2.0/linux/lib/oclfpga/host/linux64/lib:/opt/env/intel/oneapi/compiler/2022.2.0/linux/compiler/lib/intel64_lin:/opt/env/intel/oneapi/ccl/2021.7.0/lib/cpu_gpu_dpcpp:/usr/lib/x86_64-linux-gnu:
2022-11-04 08:59:46.907217: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2022-11-04 08:59:48.299450: I itex/core/devices/gpu/dpcpp_runtime.cc:123] Selected platform: Intel(R) Level-Zero
2022-11-04 08:59:48.299851: I itex/core/devices/gpu/dpcpp_runtime.cc:148] number of sub-devices is zero, expose root device.
2022-11-04 08:59:48.301803: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/env/intel/oneapi/vpl/2022.2.0/lib:/opt/env/intel/oneapi/tbb/2021.7.0/env/../lib/intel64/gcc4.8:/opt/env/intel/oneapi/mpi/2021.7.0//libfabric/lib:/opt/env/intel/oneapi/mpi/2021.7.0//lib/release:/opt/env/intel/oneapi/mpi/2021.7.0//lib:/opt/env/intel/oneapi/mkl/2022.2.0/lib/intel64:/opt/env/intel/oneapi/ipp/2021.6.1/lib/intel64:/opt/env/intel/oneapi/debugger/2021.7.0/gdb/intel64/lib:/opt/env/intel/oneapi/debugger/2021.7.0/libipt/intel64/lib:/opt/env/intel/oneapi/debugger/2021.7.0/dep/lib:/opt/env/intel/oneapi/dal/2021.7.0/lib/intel64:/opt/env/intel/oneapi/compiler/2022.2.0/linux/lib:/opt/env/intel/oneapi/compiler/2022.2.0/linux/lib/x64:/opt/env/intel/oneapi/compiler/2022.2.0/linux/lib/oclfpga/host/linux64/lib:/opt/env/intel/oneapi/compiler/2022.2.0/linux/compiler/lib/intel64_lin:/opt/env/intel/oneapi/ccl/2021.7.0/lib/cpu_gpu_dpcpp:/usr/lib/x86_64-linux-gnu:
2022-11-04 08:59:48.301982: W tensorflow/stream_executor/cuda/cuda_driver.cc:263] failed call to cuInit: UNKNOWN ERROR (303)
2022-11-04 08:59:48.301998: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (pop-os): /proc/driver/nvidia/version does not exist
2022-11-04 08:59:50.417331: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-04 08:59:50.419396: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform XPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-11-04 08:59:50.419440: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:XPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: XPU, pci bus id: <undefined>)
2022-11-04 09:00:46.432620: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type XPU is enabled.
 49 981:   0%|                                                                                         | 0/50 [00:00<?, ?it/s]2022-11-04 09:01:23.187110: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type XPU is enabled.
  0   1: 100%|████████████████████████████████████████████████████████████████████████████████| 50/50 [07:39<00:00,  9.20s/it]

After inference is complete, there is again a delay of about 2 minutes, where the process is busy but the XPU is not, I suspect it might be due to some data movement operation between the CPU and the XPU device..?

Code to replicate:

# coding: utf-8
import intel_extension_for_tensorflow as itex
import tensorflow as tf
from PIL import Image
from stable_diffusion_tf.stable_diffusion import StableDiffusion

def set_backend(backend="GPU"):
    #auto_mixed_precision_options = itex.AutoMixedPrecisionOptions()
    #auto_mixed_precision_options.data_type = itex.FLOAT16
    #graph_options = itex.GraphOptions(auto_mixed_precision_options=auto_mixed_precision_options)
    #graph_options.auto_mixed_precision = itex.ON
    #config = itex.ConfigProto(graph_options=graph_options)
    #itex.set_backend(backend, config)
    itex.set_backend(backend)

if __name__ == "__main__":
    set_backend()
    prompt = "Red air balloons in the blue sky evening golden rays from the sun paris"
    generator = StableDiffusion(
        img_height=512,
        img_width=512,
        jit_compile=False,
    )

for _ in range(1):
    img = generator.generate(
        prompt,
        num_steps=50,
        unconditional_guidance_scale=7.5,
        temperature=1,
        batch_size=1,
    )
Image.fromarray(img[0]).save("/home/rahul/sd_tf_fp32.png")

With ITEX_VERBOSE=1, the logs look like:

head -n 100 tf_stderr.log 
2022-11-04 09:59:13.193048: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-04 09:59:13.312172: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-11-04 09:59:13.316717: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/env/intel/oneapi/vpl/2022.2.0/lib:/opt/env/intel/oneapi/tbb/2021.7.0/env/../lib/intel64/gcc4.8:/opt/env/intel/oneapi/mpi/2021.7.0//libfabric/lib:/opt/env/intel/oneapi/mpi/2021.7.0//lib/release:/opt/env/intel/oneapi/mpi/2021.7.0//lib:/opt/env/intel/oneapi/mkl/2022.2.0/lib/intel64:/opt/env/intel/oneapi/ipp/2021.6.1/lib/intel64:/opt/env/intel/oneapi/debugger/2021.7.0/gdb/intel64/lib:/opt/env/intel/oneapi/debugger/2021.7.0/libipt/intel64/lib:/opt/env/intel/oneapi/debugger/2021.7.0/dep/lib:/opt/env/intel/oneapi/dal/2021.7.0/lib/intel64:/opt/env/intel/oneapi/compiler/2022.2.0/linux/lib:/opt/env/intel/oneapi/compiler/2022.2.0/linux/lib/x64:/opt/env/intel/oneapi/compiler/2022.2.0/linux/lib/oclfpga/host/linux64/lib:/opt/env/intel/oneapi/compiler/2022.2.0/linux/compiler/lib/intel64_lin:/opt/env/intel/oneapi/ccl/2021.7.0/lib/cpu_gpu_dpcpp:/usr/lib/x86_64-linux-gnu:
2022-11-04 09:59:13.316731: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-11-04 09:59:13.340501: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-11-04 09:59:13.888197: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/env/intel/oneapi/vpl/2022.2.0/lib:/opt/env/intel/oneapi/tbb/2021.7.0/env/../lib/intel64/gcc4.8:/opt/env/intel/oneapi/mpi/2021.7.0//libfabric/lib:/opt/env/intel/oneapi/mpi/2021.7.0//lib/release:/opt/env/intel/oneapi/mpi/2021.7.0//lib:/opt/env/intel/oneapi/mkl/2022.2.0/lib/intel64:/opt/env/intel/oneapi/ipp/2021.6.1/lib/intel64:/opt/env/intel/oneapi/debugger/2021.7.0/gdb/intel64/lib:/opt/env/intel/oneapi/debugger/2021.7.0/libipt/intel64/lib:/opt/env/intel/oneapi/debugger/2021.7.0/dep/lib:/opt/env/intel/oneapi/dal/2021.7.0/lib/intel64:/opt/env/intel/oneapi/compiler/2022.2.0/linux/lib:/opt/env/intel/oneapi/compiler/2022.2.0/linux/lib/x64:/opt/env/intel/oneapi/compiler/2022.2.0/linux/lib/oclfpga/host/linux64/lib:/opt/env/intel/oneapi/compiler/2022.2.0/linux/compiler/lib/intel64_lin:/opt/env/intel/oneapi/ccl/2021.7.0/lib/cpu_gpu_dpcpp:/usr/lib/x86_64-linux-gnu:
2022-11-04 09:59:13.888277: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/env/intel/oneapi/vpl/2022.2.0/lib:/opt/env/intel/oneapi/tbb/2021.7.0/env/../lib/intel64/gcc4.8:/opt/env/intel/oneapi/mpi/2021.7.0//libfabric/lib:/opt/env/intel/oneapi/mpi/2021.7.0//lib/release:/opt/env/intel/oneapi/mpi/2021.7.0//lib:/opt/env/intel/oneapi/mkl/2022.2.0/lib/intel64:/opt/env/intel/oneapi/ipp/2021.6.1/lib/intel64:/opt/env/intel/oneapi/debugger/2021.7.0/gdb/intel64/lib:/opt/env/intel/oneapi/debugger/2021.7.0/libipt/intel64/lib:/opt/env/intel/oneapi/debugger/2021.7.0/dep/lib:/opt/env/intel/oneapi/dal/2021.7.0/lib/intel64:/opt/env/intel/oneapi/compiler/2022.2.0/linux/lib:/opt/env/intel/oneapi/compiler/2022.2.0/linux/lib/x64:/opt/env/intel/oneapi/compiler/2022.2.0/linux/lib/oclfpga/host/linux64/lib:/opt/env/intel/oneapi/compiler/2022.2.0/linux/compiler/lib/intel64_lin:/opt/env/intel/oneapi/ccl/2021.7.0/lib/cpu_gpu_dpcpp:/usr/lib/x86_64-linux-gnu:
2022-11-04 09:59:13.888283: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2022-11-04 09:59:14.653824: I ./itex/core/graph/remapper/fusion.h:150] Register fusion batchmatmulv2-with-mul-addv2 with AddV2
2022-11-04 09:59:14.653848: I ./itex/core/graph/remapper/fusion.h:150] Register fusion cast-bf16fusedmatmul-cast with Cast
2022-11-04 09:59:14.653854: I ./itex/core/graph/remapper/fusion.h:150] Register fusion cast-bf16fusedmatmul-cast with Cast
2022-11-04 09:59:14.653860: I ./itex/core/graph/remapper/fusion.h:150] Register fusion cast-bf16matmul-cast with Cast
2022-11-04 09:59:14.653866: I ./itex/core/graph/remapper/fusion.h:150] Register fusion pad-with-conv_backprop_filter with Conv2DBackpropFilter
2022-11-04 09:59:14.653869: I ./itex/core/graph/remapper/fusion.h:150] Register fusion pad-with-conv_backprop_filter with Conv2DBackpropFilterWithBias
2022-11-04 09:59:14.653872: I ./itex/core/graph/remapper/fusion.h:150] Register fusion pad-with-conv_backprop_filter with Conv3DBackpropFilter
2022-11-04 09:59:14.653874: I ./itex/core/graph/remapper/fusion.h:150] Register fusion pad-with-conv_backprop_filter with Conv3DBackpropFilterV2
2022-11-04 09:59:14.653877: I ./itex/core/graph/remapper/fusion.h:150] Register fusion pad-with-conv_backprop_filter with Conv3DBackpropFilterWithBias
2022-11-04 09:59:14.653929: I ./itex/core/graph/remapper/fusion.h:150] Register fusion gru with AddV2
2022-11-04 09:59:14.654011: I ./itex/core/graph/remapper/fusion.h:150] Register fusion augru with AddV2
2022-11-04 09:59:14.654032: I ./itex/core/graph/remapper/fusion.h:150] Register fusion instancenorm with AddV2
2022-11-04 09:59:14.654058: I ./itex/core/graph/remapper/fusion.h:150] Register fusion InstanceNorm+LeakyRelu with LeakyRelu
2022-11-04 09:59:14.654091: I ./itex/core/graph/remapper/fusion.h:150] Register fusion InstanceNorm+Relu with Relu
2022-11-04 09:59:14.654103: I ./itex/core/graph/remapper/fusion.h:150] Register fusion layernorm with AddV2
2022-11-04 09:59:14.654116: I ./itex/core/graph/remapper/fusion.h:150] Register fusion layernorm-for-TransformerLT with AddV2
2022-11-04 09:59:14.654129: I ./itex/core/graph/remapper/fusion.h:150] Register fusion pad-conv3d with Conv3D
2022-11-04 09:59:14.654138: I ./itex/core/graph/remapper/fusion.h:150] Register fusion pad-conv3d-with-cast with Conv3D
2022-11-04 09:59:14.654152: I ./itex/core/graph/remapper/fusion.h:150] Register fusion resize-nearest-neighbor with ConcatV2
2022-11-04 09:59:14.654167: I ./itex/core/graph/remapper/fusion.h:150] Register fusion resize-nearest-neighbor-grad with ConcatV2
2022-11-04 09:59:14.654172: I ./itex/core/graph/remapper/fusion.h:150] Register fusion resize-nearest-neighbor-grad-v2 with ConcatV2
2022-11-04 09:59:14.654177: I ./itex/core/graph/remapper/fusion.h:150] Register fusion squeeze-resize-nearest-neighbor-grad with ResizeNearestNeighborGrad
2022-11-04 09:59:14.654188: I ./itex/core/graph/remapper/fusion.h:150] Register fusion rmsprop-compute-rms with AddV2
2022-11-04 09:59:14.654196: I ./itex/core/graph/remapper/fusion.h:150] Register fusion rmsprop-var-update with Sub
2022-11-04 09:59:14.654201: I ./itex/core/graph/remapper/fusion.h:150] Register fusion sigmoid-with-mul with Mul
2022-11-04 09:59:14.654206: I ./itex/core/graph/remapper/fusion.h:150] Register fusion sigmoid-alpha-with-mul with Mul
2022-11-04 09:59:14.664960: I itex/core/graph/xpu_graph.cc:53] ITEX config onednn_graph is OFF.
2022-11-04 09:59:14.664975: I itex/core/graph/xpu_graph.cc:53] ITEX config remapper is ON.
2022-11-04 09:59:14.664978: I itex/core/graph/xpu_graph.cc:53] ITEX config layout_opt is ON.
2022-11-04 09:59:14.664981: I itex/core/graph/xpu_graph.cc:53] ITEX config native_format is OFF.
2022-11-04 09:59:14.664984: I itex/core/graph/xpu_graph.cc:53] ITEX config auto_mixed_precision is OFF.
2022-11-04 09:59:14.664987: I itex/core/graph/xpu_graph.cc:53] ITEX config tile_as_device is ON.
2022-11-04 09:59:14.664991: I itex/core/graph/xpu_graph.cc:53] ITEX config cache_onednn_object is ON.
2022-11-04 09:59:14.735153: I itex/core/devices/gpu/dpcpp_runtime.cc:123] Selected platform: Intel(R) Level-Zero
2022-11-04 09:59:14.735547: I itex/core/devices/gpu/dpcpp_runtime.cc:148] number of sub-devices is zero, expose root device.
2022-11-04 09:59:14.735770: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/env/intel/oneapi/vpl/2022.2.0/lib:/opt/env/intel/oneapi/tbb/2021.7.0/env/../lib/intel64/gcc4.8:/opt/env/intel/oneapi/mpi/2021.7.0//libfabric/lib:/opt/env/intel/oneapi/mpi/2021.7.0//lib/release:/opt/env/intel/oneapi/mpi/2021.7.0//lib:/opt/env/intel/oneapi/mkl/2022.2.0/lib/intel64:/opt/env/intel/oneapi/ipp/2021.6.1/lib/intel64:/opt/env/intel/oneapi/debugger/2021.7.0/gdb/intel64/lib:/opt/env/intel/oneapi/debugger/2021.7.0/libipt/intel64/lib:/opt/env/intel/oneapi/debugger/2021.7.0/dep/lib:/opt/env/intel/oneapi/dal/2021.7.0/lib/intel64:/opt/env/intel/oneapi/compiler/2022.2.0/linux/lib:/opt/env/intel/oneapi/compiler/2022.2.0/linux/lib/x64:/opt/env/intel/oneapi/compiler/2022.2.0/linux/lib/oclfpga/host/linux64/lib:/opt/env/intel/oneapi/compiler/2022.2.0/linux/compiler/lib/intel64_lin:/opt/env/intel/oneapi/ccl/2021.7.0/lib/cpu_gpu_dpcpp:/usr/lib/x86_64-linux-gnu:
2022-11-04 09:59:14.735783: W tensorflow/stream_executor/cuda/cuda_driver.cc:263] failed call to cuInit: UNKNOWN ERROR (303)
2022-11-04 09:59:14.735800: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (pop-os): /proc/driver/nvidia/version does not exist
2022-11-04 09:59:16.752828: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-04 09:59:16.755135: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform XPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-11-04 09:59:16.755173: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:XPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: XPU, pci bus id: <undefined>)
2022-11-04 09:59:16.770372: I itex/core/devices/bfc_allocator.cc:26] Set memory limit to 15386382336 Bytes
2022-11-04 09:59:16.770400: I itex/core/devices/bfc_allocator.cc:37] Creating bin of max chunk size 256
2022-11-04 09:59:16.770404: I itex/core/devices/bfc_allocator.cc:37] Creating bin of max chunk size 512
2022-11-04 09:59:16.770407: I itex/core/devices/bfc_allocator.cc:37] Creating bin of max chunk size 1024
2022-11-04 09:59:16.770409: I itex/core/devices/bfc_allocator.cc:37] Creating bin of max chunk size 2048
2022-11-04 09:59:16.770411: I itex/core/devices/bfc_allocator.cc:37] Creating bin of max chunk size 4096
2022-11-04 09:59:16.770414: I itex/core/devices/bfc_allocator.cc:37] Creating bin of max chunk size 8192
2022-11-04 09:59:16.770416: I itex/core/devices/bfc_allocator.cc:37] Creating bin of max chunk size 16384
2022-11-04 09:59:16.770418: I itex/core/devices/bfc_allocator.cc:37] Creating bin of max chunk size 32768
2022-11-04 09:59:16.770421: I itex/core/devices/bfc_allocator.cc:37] Creating bin of max chunk size 65536
2022-11-04 09:59:16.770423: I itex/core/devices/bfc_allocator.cc:37] Creating bin of max chunk size 131072
2022-11-04 09:59:16.770425: I itex/core/devices/bfc_allocator.cc:37] Creating bin of max chunk size 262144
2022-11-04 09:59:16.770427: I itex/core/devices/bfc_allocator.cc:37] Creating bin of max chunk size 524288
2022-11-04 09:59:16.770430: I itex/core/devices/bfc_allocator.cc:37] Creating bin of max chunk size 1048576
2022-11-04 09:59:16.770432: I itex/core/devices/bfc_allocator.cc:37] Creating bin of max chunk size 2097152
2022-11-04 09:59:16.770434: I itex/core/devices/bfc_allocator.cc:37] Creating bin of max chunk size 4194304
2022-11-04 09:59:16.770437: I itex/core/devices/bfc_allocator.cc:37] Creating bin of max chunk size 8388608
2022-11-04 09:59:16.770439: I itex/core/devices/bfc_allocator.cc:37] Creating bin of max chunk size 16777216
2022-11-04 09:59:16.770441: I itex/core/devices/bfc_allocator.cc:37] Creating bin of max chunk size 33554432
2022-11-04 09:59:16.770444: I itex/core/devices/bfc_allocator.cc:37] Creating bin of max chunk size 67108864
2022-11-04 09:59:16.770446: I itex/core/devices/bfc_allocator.cc:37] Creating bin of max chunk size 134217728
2022-11-04 09:59:16.770448: I itex/core/devices/bfc_allocator.cc:37] Creating bin of max chunk size 268435456
2022-11-04 09:59:16.771330: I itex/core/devices/bfc_allocator.cc:302] Extending allocation by 4294967296 bytes.
2022-11-04 09:59:16.771340: I itex/core/devices/bfc_allocator.cc:305] Total allocated bytes: 4294967296
2022-11-04 09:59:16.771344: I itex/core/devices/bfc_allocator.cc:307] Allocated memory at 0xffffeaab55400000 to 0xffffeaac55400000
2022-11-04 09:59:17.678445: I ./itex/core/utils/op_kernel.h:726] AssignVariableOp,AssignVariableOp,11987
2022-11-04 09:59:17.678892: I ./itex/core/utils/op_kernel.h:726] AssignVariableOp,AssignVariableOp,3456
2022-11-04 09:59:17.679161: I ./itex/core/utils/op_kernel.h:726] AssignVariableOp,AssignVariableOp,2198
2022-11-04 09:59:17.796942: I ./itex/core/utils/op_kernel.h:726] StatelessRandomGetKeyCounter,StatelessRandomGetKeyCounter,476225
2022-11-04 09:59:18.357494: I ./itex/core/utils/op_kernel.h:726] StatelessRandomUniformV2,StatelessRandomUniformV2,547899149
2022-11-04 09:59:51.704338: I ./itex/core/utils/op_kernel.h:726] Sub,Sub,33344981458
2022-11-04 09:59:51.705890: I ./itex/core/utils/op_kernel.h:726] Mul,Mul,251732
2022-11-04 09:59:51.705949: I itex/core/devices/bfc_allocator.cc:106] Deallocate 0xffffeaab5e4c6200
2022-11-04 09:59:51.707275: I ./itex/core/utils/op_kernel.h:726] AddV2,AddV2,440534
2022-11-04 09:59:51.707317: I itex/core/devices/bfc_allocator.cc:106] Deallocate 0xffffeaab5e4c6300
2022-11-04 09:59:51.707323: I itex/core/devices/bfc_allocator.cc:345] Merging c 0xffffeaab5e4c6300 into c->prev 0xffffeaab5e4c6200
2022-11-04 09:59:51.707344: I itex/core/devices/bfc_allocator.cc:106] Deallocate 0xffffeaab55406000
2022-11-04 09:59:51.707349: I itex/core/devices/bfc_allocator.cc:106] Deallocate 0xffffeaab55406100
2022-11-04 09:59:51.707352: I itex/core/devices/bfc_allocator.cc:345] Merging c 0xffffeaab55406100 into c->prev 0xffffeaab55406000
2022-11-04 09:59:51.707355: I itex/core/devices/bfc_allocator.cc:106] Deallocate 0xffffeaab55406200
2022-11-04 09:59:51.707358: I itex/core/devices/bfc_allocator.cc:337] Merging c->next 0xffffeaab5e4c6200 with c 0xffffeaab55406200
2022-11-04 09:59:51.707361: I itex/core/devices/bfc_allocator.cc:345] Merging c 0xffffeaab55406200 into c->prev 0xffffeaab55406000
2022-11-04 09:59:51.709042: I ./itex/core/utils/op_kernel.h:726] AssignVariableOp,AssignVariableOp,6500
2022-11-04 09:59:51.713602: I ./itex/core/utils/op_kernel.h:726] StatelessRandomGetKeyCounter,StatelessRandomGetKeyCounter,27836
2022-11-04 09:59:51.745497: I ./itex/core/utils/op_kernel.h:726] StatelessRandomUniformV2,StatelessRandomUniformV2,315099
2022-11-04 09:59:51.745686: I ./itex/core/utils/op_kernel.h:726] Sub,Sub,49634
2022-11-04 09:59:51.745776: I ./itex/core/utils/op_kernel.h:726] Mul,Mul,25621
2022-11-04 09:59:51.745792: I itex/core/devices/bfc_allocator.cc:106] Deallocate 0xffffeaab5543fe00
2022-11-04 09:59:51.745872: I ./itex/core/utils/op_kernel.h:726] AddV2,AddV2,23619
....
....

[1]. https://github.com/divamgupta/stable-diffusion-tensorflow

tedliosu commented 1 year ago

Honestly, it might be that because the workload does FP64-related operations in the background if I were to guess; FP64 is not supported natively by ARC GPUs (i.e. they literally don't have FP64 cores in their microarch), and therefore each time an FP64 operation is executed it takes a very long time for the driver to emulate the operations in the background. My suspicions come from my experience that I had building ITEX with and without FP64 support as stated here; @rahulunair did you have to enable FP64 emulation per instructions from here for the SD TF port to run?

yiqianglee commented 1 year ago

@tedliosu I guess not, in ITEX official release binary, FP64 emulation should not work as mentioned in another issue (-cl-poison-unsupported-fp64-kernels is used to remove FP64 kernel). For this issue's description, seems CPU activity is dominated, we will try to reproduce it first. @rahulunair do you see similar behavior when running this in other GPUs (like CPU activity is dominated)?

tedliosu commented 1 year ago

@tedliosu I guess not, in ITEX official release binary, FP64 emulation should not work as mentioned in another issue (-cl-poison-unsupported-fp64-kernels is used to remove FP64 kernel). For this issue's description, seems CPU activity is dominated, we will try to reproduce it first. @rahulunair do you see similar behavior when running this in other GPUs (like CPU activity is dominated)?

@yiqianglee I ran the same script that rahulunair ran on my Laptop with Intel i5 11400H's iGPU in it (I couldn't get the script to work on my Nvidia 3050 mobile no matter how I tweaked it bc I kept running out of VRAM no matter what I did :confused:), albeit with some minor tweaking because I also have an Nvidia mobile GPU on my machine; here's the script that I ran fyi:

# coding: utf-8

import os
# Banish Nvidia to the nether regions >:)
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

import intel_extension_for_tensorflow as itex
import tensorflow as tf
from PIL import Image
from stable_diffusion_tf.stable_diffusion import StableDiffusion

def set_backend(backend="GPU"):
    #auto_mixed_precision_options = itex.AutoMixedPrecisionOptions()
    #auto_mixed_precision_options.data_type = itex.FLOAT16
    #graph_options = itex.GraphOptions(auto_mixed_precision_options=auto_mixed_precision_options)
    #graph_options.auto_mixed_precision = itex.ON
    #config = itex.ConfigProto(graph_options=graph_options)
    #itex.set_backend(backend, config)
    itex.set_backend(backend)

if __name__ == "__main__":
    set_backend()
    prompt = "Red air balloons in the blue sky evening golden rays from the sun paris"
    generator = StableDiffusion(
        img_height=512,
        img_width=512,
        jit_compile=False,
    )

for _ in range(1):
    img = generator.generate(
        prompt,
        num_steps=50,
        unconditional_guidance_scale=7.5,
        temperature=1,
        batch_size=1,
    )
Image.fromarray(img[0]).save("./sd_tf_fp32.png")

And here's a link to a video of the script running on the Intel iGPU on my laptop along with some (hopefully) useful stats that were displayed along with me running the script. :smile:

Btw I installed ITEX from source because I was testing whether or not an issue that I was facing was due to a lack of float16 support from both ITEX and oneDNN, and I had to build ITEX without -cl-poison-unsupported-fp64-kernels because otherwise some of my own TF scripts wouldn't run with that build option in place; thus I also had to enable FP64 emulation per here before I ran the script as otherwise the script would segfault. Hopefully that doesn't change things too much for you to help them diagnose the root of this issue. :+1:

yiqianglee commented 1 year ago

From the video, I think memory movement between CPU and GPU is unlike the issue, because I see there is almost no activity on blitter, which is copy engine used to do H2D/D2H copy, but I see CPU is busy, normally, if this is a GPU bound app, we should not see so many activities on CPU side. Anyway, thanks for your information @tedliosu , we will have a look.

rahulunair commented 1 year ago

@tedliosu @yiqianglee so, it was some issue with the drivers on my end.

I had customized my config pretty heavily, used mainline 6.0 kernel, built the drivers etc.. I should have followed the [documentation] (https://dgpu-docs.intel.com/installation-guides/ubuntu/ubuntu-jammy-arc.html) on how to setup arc on linux like a sane person :). So, i finally went ahead, wiped my system, installed Ubuntu 22.04 and setup up the drivers as per the docs and the results for stable diffusion on Arc 770 is as below in FP32 mode:

In [12]: for _ in range(3):
    ...:     img = generator.generate(
    ...:         prompt,
    ...:         num_steps=50,
    ...:         unconditional_guidance_scale=7.5,
    ...:         temperature=1,
    ...:         batch_size=1,
    ...:     )
    ...:

  0   1: 100%|███████████████████████████████████████████████████████| 50/50 [00:29<00:00,  1.68it/s]
  0   1: 100%|███████████████████████████████████████████████████████| 50/50 [00:29<00:00,  1.69it/s]
  0   1: 100%|███████████████████████████████████████████████████████| 50/50 [00:29<00:00,  1.68it/s]

In [13]:

It takes an average of around ~29 seconds, not the 7 minutes as before :D.. I am now looking at the profiles to see if this can be further improved to under 20 seconds..(if there are any pointers on this, I am all ears..)

If anyone wants the drivers, kernel version etc that worked, here you go:

05:11:35 rahul@rahul-a770m ~ → uname -r
5.17.0-1019-oem

usermod drivers:

intel-level-zero-gpu  (1.3.23937+i449~u22.04).
intel-opencl-icd  (22.32.23937+i449~u22.04).
level-zero (1.8.5+i449~u22.04).
libigdgmm12  (22.1.7+i449~u22.04).

yiqianglee commented 1 year ago

@rahulunair I'm happy to see the new result. :) You can try ITEX profiler (profiler) to see which op is the hotspot, internally we see some opportunity also, work in progress.

rahulunair commented 1 year ago

Thanks @yiqianglee yup, trying out the profiler.. I think this issue can be closed now, thank for your support!

intel / intel-extension-for-tensorflow

Stable difussion inference very slow on Arc A770 #10

Code to replicate: