intel / intel-extension-for-openxla

Apache License 2.0
36 stars 10 forks source link

Build Challenges #35

Closed coreyjadams closed 3 months ago

coreyjadams commented 3 months ago

When trying to build intel_extension_for_openxla, I hit a couple of issues. Reporting them here since they seem easily fixable upstream.

First, the build config expects MKL and DPCPP to be in the same location but that is not always exactly true - on Aurora the installation layout appears different from a "default" install. I was able to resolve this with:

diff --git a/configure.py b/configure.py
index 73207cd..1e2c47e 100644
--- a/configure.py
+++ b/configure.py
@@ -668,7 +668,10 @@ def set_sycl_toolkit_path(environ_cp):
     """Check if a mkl toolkit path is valid."""
     home_path = toolkit_path.split("compiler")[0]
     version = toolkit_path.split("compiler")[1].split("/")[1]
-    mkl_path = os.path.join(home_path, 'mkl' + '/' + version + '/')
+    if "MKLROOT" in os.environ:
+      mkl_path = os.environ['MKL_ROOT']
+    else:
+      mkl_path = os.path.join(home_path, 'mkl' + '/' + version + '/')
     exists = (
         os.path.exists(os.path.join(mkl_path, 'include')) and
         os.path.exists(os.path.join(mkl_path, 'lib')))
lines 1-16/16 (END)

Second, the build fails because it misses include files - specifically, in this case, level_zero/ze_api.h. I am not a bazel expert and don't want to become one, but I found I could add CPATH to the file .xla_extension_configure.bazelrc manually, based on the contents of the CPATH variable with my environment set up, and the build completes. I see that LD_LIBRARY_PATH is already doing that - can you consider adding CPATH here as well?

When I make these changes, I can build successfully.

Thanks, Corey

coreyjadams commented 3 months ago

Another issue that comes up at runtime:

>>> jax.local_devices()
INFO: Intel Extension for OpenXLA version: 0.3.0, commit: fb17dd03
Jax plugin configuration error: Exception when calling jax_plugins.intel_extension_for_openxla.initialize()
Traceback (most recent call last):
  File "/soft/datascience/jax/0.4.25/miniconda3/lib/python3.11/site-packages/jax/_src/xla_bridge.py", line 482, in discover_pjrt_plugins
    plugin_module.initialize()
  File "/soft/datascience/jax/0.4.25/miniconda3/lib/python3.11/site-packages/jax_plugins/intel_extension_for_openxla/__init__.py", line 39, in initialize
    c_api = xb.register_plugin("xpu",
            ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/soft/datascience/jax/0.4.25/miniconda3/lib/python3.11/site-packages/jax/_src/xla_bridge.py", line 544, in register_plugin
    c_api = xla_client.load_pjrt_plugin_dynamically(plugin_name, library_path)  # type: ignore
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/soft/datascience/jax/0.4.25/miniconda3/lib/python3.11/site-packages/jaxlib/xla_client.py", line 155, in load_pjrt_plugin_dynamically
    return _xla.load_pjrt_plugin(plugin_name, library_path, c_api=None)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
jaxlib.xla_extension.XlaRuntimeError: INTERNAL: Failed to open /soft/datascience/jax/0.4.25/miniconda3/lib/python3.11/site-packages/jax_plugins/intel_extension_for_openxla/pjrt_plugin_xpu.so: /soft/datascience/jax/0.4.25/miniconda3/lib/python3.11/site-packages/jax_plugins/intel_extension_for_openxla/pjrt_plugin_xpu.so: undefined symbol: _ZN4dnnl4impl5graph5utils4id_t7counterE
[CpuDevice(id=0)]

It looks like at the end, it does not link against dnnl. Any suggestions to patch that in to the build?

Zantares commented 3 months ago

Hi @coreyjadams , OpenXLA Extension will download oneDNN and static link it to this pjrt_plugin_xpu.so by default. Your Aurora environment may impact the build process since the system may have its own oneDNN and set environment variable.

Could you please list your toolkit env settings by env and list the library dependency by ldd pjrt_plugin_xpu.so? Here's a simple reference of my env:

env
...
PKG_CONFIG_PATH=/mnt1/sdp/intel/oneapi/mkl/2024.2/lib/pkgconfig:/mnt1/sdp/intel/oneapi/compiler/2024.2/lib/pkgconfig
DIAGUTIL_PATH=/mnt1/sdp/intel/oneapi/compiler/2024.2/etc/compiler/sys_check/sys_check.sh
PWD=/mnt1/sdp/miniconda3/envs/tenglu/lib/python3.9/site-packages
MANPATH=/mnt1/sdp/intel/oneapi/compiler/2024.2/share/man:
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libstdc++.so.6
TRANSFORMERS_CACHE=/data/transformers_cache/
CMAKE_PREFIX_PATH=/mnt1/sdp/intel/oneapi/mkl/2024.2/lib/cmake:/mnt1/sdp/intel/oneapi/compiler/2024.2
CMPLR_ROOT=/mnt1/sdp/intel/oneapi/compiler/2024.2
LIBRARY_PATH=/mnt1/sdp/intel/oneapi/mkl/2024.2/lib/:/mnt1/sdp/intel/oneapi/compiler/2024.2/lib
OCL_ICD_FILENAMES=/mnt1/sdp/intel/oneapi/compiler/2024.2/lib/libintelocl.so
CONDA_PYTHON_EXE=/home1/sdp/mambaforge/bin/python
LD_LIBRARY_PATH=/mnt1/sdp/intel/oneapi/mkl/2024.2/lib:/mnt1/sdp/intel/oneapi/compiler/2024.2/opt/compiler/lib:/mnt1/sdp/intel/oneapi/compiler/2024.2/lib
MKLROOT=/mnt1/sdp/intel/oneapi/mkl/2024.2
NLSPATH=/mnt1/sdp/intel/oneapi/mkl/2024.2/share/locale/%l_%t/%N:/mnt1/sdp/intel/oneapi/compiler/2024.2/lib/compiler/locale/%l_%t/%N
PATH=/mnt1/sdp/intel/oneapi/mkl/2024.2/bin/:/mnt1/sdp/intel/oneapi/compiler/2024.2/bin:/home/sdp/bin:/home/sdp/bin:/home/sdp/miniconda3/envs/tenglu/bin:/home1/sdp/mambaforge/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
CONDA_PREFIX_1=/home1/sdp/mambaforge
CPATH=/mnt1/sdp/intel/oneapi/mkl/2024.2/include:
ldd ./jax_plugins/intel_extension_for_openxla/pjrt_plugin_xpu.so

        linux-vdso.so.1 (0x00007ffda95fe000)
        /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007effc7cb7000)
        libmkl_intel_ilp64.so.2 => /mnt1/sdp/intel/oneapi/mkl/2024.2/lib/libmkl_intel_ilp64.so.2 (0x00007effc6f68000)
        libmkl_sequential.so.2 => /mnt1/sdp/intel/oneapi/mkl/2024.2/lib/libmkl_sequential.so.2 (0x00007effc5a4f000)
        libmkl_core.so.2 => /mnt1/sdp/intel/oneapi/mkl/2024.2/lib/libmkl_core.so.2 (0x00007effc1ad9000)
        libmkl_sycl_blas.so.4 => /mnt1/sdp/intel/oneapi/mkl/2024.2/lib/libmkl_sycl_blas.so.4 (0x00007effbc040000)
        libmkl_sycl_lapack.so.4 => /mnt1/sdp/intel/oneapi/mkl/2024.2/lib/libmkl_sycl_lapack.so.4 (0x00007effb974b000)
        libmkl_sycl_sparse.so.4 => /mnt1/sdp/intel/oneapi/mkl/2024.2/lib/libmkl_sycl_sparse.so.4 (0x00007effb2f2c000)
        libmkl_sycl_dft.so.4 => /mnt1/sdp/intel/oneapi/mkl/2024.2/lib/libmkl_sycl_dft.so.4 (0x00007effaff87000)
        libmkl_sycl_vm.so.4 => /mnt1/sdp/intel/oneapi/mkl/2024.2/lib/libmkl_sycl_vm.so.4 (0x00007effa6f07000)
        libmkl_sycl_rng.so.4 => /mnt1/sdp/intel/oneapi/mkl/2024.2/lib/libmkl_sycl_rng.so.4 (0x00007effa16f0000)
        libmkl_sycl_stats.so.4 => /mnt1/sdp/intel/oneapi/mkl/2024.2/lib/libmkl_sycl_stats.so.4 (0x00007eff9f6bc000)
        libmkl_sycl_data_fitting.so.4 => /mnt1/sdp/intel/oneapi/mkl/2024.2/lib/libmkl_sycl_data_fitting.so.4 (0x00007eff9ede8000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007eff9edd6000)
        libimf.so => /mnt1/sdp/intel/oneapi/compiler/2024.2/lib/libimf.so (0x00007eff9e9b6000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007eff9e8cf000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007eff9e8c8000)
        libsycl.so.7 => /mnt1/sdp/intel/oneapi/compiler/2024.2/lib/libsycl.so.7 (0x00007eff9e51d000)
        libze_loader.so.1 => /lib/x86_64-linux-gnu/libze_loader.so.1 (0x00007eff9e46b000)
        libOpenCL.so.1 => /mnt1/sdp/intel/oneapi/compiler/2024.2/lib/libOpenCL.so.1 (0x00007eff9e45c000)
        libsvml.so => /mnt1/sdp/intel/oneapi/compiler/2024.2/lib/libsvml.so (0x00007eff9ce18000)
        libirng.so => /mnt1/sdp/intel/oneapi/compiler/2024.2/lib/libirng.so (0x00007eff9cd1f000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007eff9ccfd000)
        libintlc.so.5 => /mnt1/sdp/intel/oneapi/compiler/2024.2/lib/libintlc.so.5 (0x00007eff9cc9c000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007eff9ca73000)
        /lib64/ld-linux-x86-64.so.2 (0x00007effd5d0c000)

BTW, you may try to unset related oneDNN env settings and try to rebuild it.

coreyjadams commented 3 months ago

Awesome, yes, this did it. I had already found it wasn't linking with ldd but I hadn't considered you were doing a static build of oneDNN inside the extension.

In case you still want it, here's the linked libraries list:

    linux-vdso.so.1 (0x00007ffc35bc1000)
    libmkl_intel_ilp64.so.2 => /soft/compilers/oneapi/2024.04.15.001/oneapi/mkl/latest/lib/libmkl_intel_ilp64.so.2 (0x00007f2495640000)
    libmkl_sequential.so.2 => /soft/compilers/oneapi/2024.04.15.001/oneapi/mkl/latest/lib/libmkl_sequential.so.2 (0x00007f249422c000)
    libmkl_core.so.2 => /soft/compilers/oneapi/2024.04.15.001/oneapi/mkl/latest/lib/libmkl_core.so.2 (0x00007f24900c8000)
    libmkl_sycl_blas.so.4 => /soft/compilers/oneapi/2024.04.15.001/oneapi/mkl/latest/lib/libmkl_sycl_blas.so.4 (0x00007f248aa45000)
    libmkl_sycl_lapack.so.4 => /soft/compilers/oneapi/2024.04.15.001/oneapi/mkl/latest/lib/libmkl_sycl_lapack.so.4 (0x00007f2488390000)
    libmkl_sycl_sparse.so.4 => /soft/compilers/oneapi/2024.04.15.001/oneapi/mkl/latest/lib/libmkl_sycl_sparse.so.4 (0x00007f2481e0e000)
    libmkl_sycl_dft.so.4 => /soft/compilers/oneapi/2024.04.15.001/oneapi/mkl/latest/lib/libmkl_sycl_dft.so.4 (0x00007f247ee65000)
    libmkl_sycl_vm.so.4 => /soft/compilers/oneapi/2024.04.15.001/oneapi/mkl/latest/lib/libmkl_sycl_vm.so.4 (0x00007f24760c0000)
    libmkl_sycl_rng.so.4 => /soft/compilers/oneapi/2024.04.15.001/oneapi/mkl/latest/lib/libmkl_sycl_rng.so.4 (0x00007f246e2ea000)
    libmkl_sycl_stats.so.4 => /soft/compilers/oneapi/2024.04.15.001/oneapi/mkl/latest/lib/libmkl_sycl_stats.so.4 (0x00007f246c369000)
    libmkl_sycl_data_fitting.so.4 => /soft/compilers/oneapi/2024.04.15.001/oneapi/mkl/latest/lib/libmkl_sycl_data_fitting.so.4 (0x00007f246baf9000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00007f246badd000)
    libm.so.6 => /lib64/libm.so.6 (0x00007f246b992000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f246b96e000)
    libstdc++.so.6 => /opt/aurora/23.275.2/spack/gcc/0.6.1/install/linux-sles15-x86_64/gcc-12.2.0/gcc-12.2.0-jf4ov3v3scg7dvd76qhsuugl3jp42gfn/lib64/libstdc++.so.6 (0x00007f246b74b000)
    libsycl.so.7 => /soft/compilers/oneapi/2024.04.15.001/oneapi/compiler/latest/lib/libsycl.so.7 (0x00007f246b3b6000)
    libze_loader.so.1 => /home/ftartagl/graphics-compute-runtime/agama-ci-devel-803.29/usr/lib64/libze_loader.so.1 (0x00007f246b361000)
    libOpenCL.so.1 => /soft/compilers/oneapi/2024.04.15.001/oneapi/compiler/latest/lib/libOpenCL.so.1 (0x00007f246b352000)
    libgcc_s.so.1 => /opt/aurora/23.275.2/spack/gcc/0.6.1/install/linux-sles15-x86_64/gcc-12.2.0/gcc-12.2.0-jf4ov3v3scg7dvd76qhsuugl3jp42gfn/lib64/libgcc_s.so.1 (0x00007f246b332000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f246b13d000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f24c21f0000)
    libsvml.so => /soft/compilers/oneapi/2024.04.15.001/oneapi/compiler/latest/lib/libsvml.so (0x00007f2469aeb000)
    libirng.so => /soft/compilers/oneapi/2024.04.15.001/oneapi/compiler/latest/lib/libirng.so (0x00007f24699f0000)
    libimf.so => /soft/compilers/oneapi/2024.04.15.001/oneapi/compiler/latest/lib/libimf.so (0x00007f24695d3000)
    libintlc.so.5 => /soft/compilers/oneapi/2024.04.15.001/oneapi/compiler/latest/lib/libintlc.so.5 (0x00007f2469571000)

But I can confirm on a Sunspot node that I see all 12 XPU devices:

❯ python
Python 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import jax
jax.local_devices()
>>> jax.local_devices()
INFO: Intel Extension for OpenXLA version: 0.3.0, commit: fb17dd03

Platform 'xpu' is experimental and not all JAX functionality may be correctly supported!
[xpu(id=0), xpu(id=1), xpu(id=2), xpu(id=3), xpu(id=4), xpu(id=5), xpu(id=6), xpu(id=7), xpu(id=8), xpu(id=9), xpu(id=10), xpu(id=11)]
>>> 
coreyjadams commented 3 months ago

Just in case anyone wants this information in the future, I'll finish out this thread with one more comment. I've got JAX + mpi4jax running on 48 tiles spread over 4 nodes of Sunspot @ Argonne. I intend to continue scaling up and going to bigger problem sizes.

Some further notes:

I get this warning once per rank, any way to turn this off?

WARNING:absl:Tensorflow library not found, tensorflow.io.gfile operations will use native shim calls. GCS paths (i.e. 'gs://...') cannot be accessed.

Anyways, things seem to be working for today so this issue can be closed. Thank you for the help, hopefully dumping my findings and suggestions here is useful for others in the future.

guizili0 commented 2 months ago

Just in case anyone wants this information in the future, I'll finish out this thread with one more comment. I've got JAX + mpi4jax running on 48 tiles spread over 4 nodes of Sunspot @ Argonne. I intend to continue scaling up and going to bigger problem sizes.

Some further notes:

  • I had to change the way I let JAX find the GPU tiles or it was trying to preallocate all devices onto at least tile 0.0, and maybe more. I couldn't actually debug: it would hang so much that my terminal connection would get killed :).

    • To solve this, I set ZE_AFFINITY_MASK to GPU.TILE based on the local rank.
  • I had built JAX against compute runtime 803.23 but was hitting runtime issues with that version. 803.45 solved those. Maybe this is already know - what runtime version are you targeting currently?
  • MPI4JAX_USE_SYCL_MPI=1 appears to work though I haven't profiled to confirm.

I get this warning once per rank, any way to turn this off?

WARNING:absl:Tensorflow library not found, tensorflow.io.gfile operations will use native shim calls. GCS paths (i.e. 'gs://...') cannot be accessed.

Anyways, things seem to be working for today so this issue can be closed. Thank you for the help, hopefully dumping my findings and suggestions here is useful for others in the future.

For the warning, this is from some lib from JAX, you can try to use below environment variables to disable it: TF_CPP_MAX_VLOG_LEVEL=0 TF_CPP_MIN_LOG_LEVEL=3

guizili0 commented 2 months ago

When trying to build intel_extension_for_openxla, I hit a couple of issues. Reporting them here since they seem easily fixable upstream.

First, the build config expects MKL and DPCPP to be in the same location but that is not always exactly true - on Aurora the installation layout appears different from a "default" install. I was able to resolve this with:

diff --git a/configure.py b/configure.py
index 73207cd..1e2c47e 100644
--- a/configure.py
+++ b/configure.py
@@ -668,7 +668,10 @@ def set_sycl_toolkit_path(environ_cp):
     """Check if a mkl toolkit path is valid."""
     home_path = toolkit_path.split("compiler")[0]
     version = toolkit_path.split("compiler")[1].split("/")[1]
-    mkl_path = os.path.join(home_path, 'mkl' + '/' + version + '/')
+    if "MKLROOT" in os.environ:
+      mkl_path = os.environ['MKL_ROOT']
+    else:
+      mkl_path = os.path.join(home_path, 'mkl' + '/' + version + '/')
     exists = (
         os.path.exists(os.path.join(mkl_path, 'include')) and
         os.path.exists(os.path.join(mkl_path, 'lib')))
lines 1-16/16 (END)

Second, the build fails because it misses include files - specifically, in this case, level_zero/ze_api.h. I am not a bazel expert and don't want to become one, but I found I could add CPATH to the file .xla_extension_configure.bazelrc manually, based on the contents of the CPATH variable with my environment set up, and the build completes. I see that LD_LIBRARY_PATH is already doing that - can you consider adding CPATH here as well?

When I make these changes, I can build successfully.

Thanks, Corey

@coreyjadams this build has been supported in https://github.com/intel/intel-extension-for-openxla/commit/be86a4c87c5fe7e83c51d6af55cc4aa645797452, you can have a check.