opendatahub-io / notebooks

Notebook images for ODH
Apache License 2.0
17 stars 55 forks source link

[WIP] feat(rocm): reduce the number of installed packages, focusing on pytorch, tensorflow dependencies and cli tools, excluding developer tools #633

Closed jiridanek closed 1 month ago

jiridanek commented 1 month ago

Description

https://rocm.docs.amd.com/projects/install-on-linux/en/latest/how-to/native-install/package-manager-integration.html

This way, the images will be optimized to support pytorch and tensorflow use, and the image size will be considerably smaller due to excluding unneeded tools.

How Has This Been Tested?

to do

Merge criteria:

openshift-ci[bot] commented 1 month ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please ask for approval from jiridanek. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/opendatahub-io/notebooks/blob/main/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
jiridanek commented 1 month ago

Do we need

?

(see my further comments for answers; we do need some of the libs, otherwise tensoflow crashes)

Would be nice to get rid of rocm-llvm, but I worry that ROCm is doing some last-moment compilation to adapt code to the installed hardware, so it may not be possible. Will try this when github builds new set of images for me.

caponetto commented 1 month ago

Huge improvement! See:

jiridanek commented 1 month ago

Checking the tensorflow flavor, the reduced size image failed

podman run --user 0 --entrypoint /bin/bash --device=/dev/kfd --device=/dev/dri --ipc=host  --rm -it ghcr.io/jiridanek/notebooks/workbench-images:rocm-jupyter-tensorflow-ubi9-python-3.9-jd_small_rocm_cd39007cd356dd0291185018b230d41865495eef
>>> import tensorflow as tf
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/app-root/lib64/python3.9/site-packages/tensorflow/__init__.py", line 48, in <module>
    from tensorflow._api.v2 import __internal__
  File "/opt/app-root/lib64/python3.9/site-packages/tensorflow/_api/v2/__internal__/__init__.py", line 8, in <module>
    from tensorflow._api.v2.__internal__ import autograph
  File "/opt/app-root/lib64/python3.9/site-packages/tensorflow/_api/v2/__internal__/autograph/__init__.py", line 8, in <module>
    from tensorflow.python.autograph.core.ag_ctx import control_status_ctx # line: 34
  File "/opt/app-root/lib64/python3.9/site-packages/tensorflow/python/autograph/core/ag_ctx.py", line 21, in <module>
    from tensorflow.python.autograph.utils import ag_logging
  File "/opt/app-root/lib64/python3.9/site-packages/tensorflow/python/autograph/utils/__init__.py", line 17, in <module>
    from tensorflow.python.autograph.utils.context_managers import control_dependency_on_returns
  File "/opt/app-root/lib64/python3.9/site-packages/tensorflow/python/autograph/utils/context_managers.py", line 19, in <module>
    from tensorflow.python.framework import ops
  File "/opt/app-root/lib64/python3.9/site-packages/tensorflow/python/framework/ops.py", line 40, in <module>
    from tensorflow.python import pywrap_tensorflow
  File "/opt/app-root/lib64/python3.9/site-packages/tensorflow/python/pywrap_tensorflow.py", line 34, in <module>
    self_check.preload_check()
  File "/opt/app-root/lib64/python3.9/site-packages/tensorflow/python/platform/self_check.py", line 63, in preload_check
    from tensorflow.python.platform import _pywrap_cpu_feature_guard
ImportError: librccl.so.1: cannot open shared object file: No such file or directory
(app-root) bash-5.1# dnf whatprovides '*/librccl.so.1'
...
rccl-2.18.6.60100-82.el9.x86_64 : ROCm Communication Collectives Library

Doing

(app-root) bash-5.1# ldd /opt/app-root/lib64/python3.9/site-packages/tensorflow/*.so.*

prints only one not found library, just that librccl.so.1

/opt/app-root/lib64/python3.9/site-packages/tensorflow/libtensorflow_cc.so.2:
        linux-vdso.so.1 (0x00007ffe8f1f3000)
        libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007fd2ad056000)
        libm.so.6 => /lib64/libm.so.6 (0x00007fd2acf7b000)
        libtensorflow_framework.so.2 => /opt/app-root/lib64/python3.9/site-packages/tensorflow/libtensorflow_framework.so.2 (0x00007fd2aa212000)
        librccl.so.1 => not found
        libdl.so.2 => /lib64/libdl.so.2 (0x00007fd2aa20b000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fd2aa206000)
        librt.so.1 => /lib64/librt.so.1 (0x00007fd2aa201000)
        libhsa-runtime64.so.1 => /opt/rocm-6.1.0/lib/libhsa-runtime64.so.1 (0x00007fd2a9f19000)
        libamdhip64.so.6 => /opt/rocm-6.1.0/lib/libamdhip64.so.6 (0x00007fd2a8791000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fd2a8774000)
        libc.so.6 => /lib64/libc.so.6 (0x00007fd2a856b000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fd310f19000)
        libelf.so.1 => /lib64/libelf.so.1 (0x00007fd2a854f000)
        librocprofiler-register.so.0 => /opt/rocm-6.1.0/lib/librocprofiler-register.so.0 (0x00007fd2a84a9000)
        libdrm.so.2 => /opt/amdgpu/lib64/libdrm.so.2 (0x00007fd2a8492000)
        libdrm_amdgpu.so.1 => /opt/amdgpu/lib64/libdrm_amdgpu.so.1 (0x00007fd2a8483000)
        libnuma.so.1 => /lib64/libnuma.so.1 (0x00007fd2a8475000)
        libamd_comgr.so.2 => /opt/rocm-6.1.0/lib/libamd_comgr.so.2 (0x00007fd29fe5c000)
        libz.so.1 => /lib64/libz.so.1 (0x00007fd29fe42000)
        libzstd.so.1 => /lib64/libzstd.so.1 (0x00007fd29fd6b000)
        libtinfo.so.6 => /lib64/libtinfo.so.6 (0x00007fd29fd39000)
/opt/app-root/lib64/python3.9/site-packages/tensorflow/libtensorflow_framework.so.2:
        linux-vdso.so.1 (0x00007ffc7e0bd000)
        libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f7f1e493000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f7f1e3b8000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f7f1e3b3000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f7f1e3ae000)
        librt.so.1 => /lib64/librt.so.1 (0x00007f7f1e3a9000)
        libhsa-runtime64.so.1 => /opt/rocm-6.1.0/lib/libhsa-runtime64.so.1 (0x00007f7f1e0bf000)
        libamdhip64.so.6 => /opt/rocm-6.1.0/lib/libamdhip64.so.6 (0x00007f7f1c937000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f7f1c91c000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f7f1c713000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f7f2142d000)
        libelf.so.1 => /lib64/libelf.so.1 (0x00007f7f1c6f7000)
        librocprofiler-register.so.0 => /opt/rocm-6.1.0/lib/librocprofiler-register.so.0 (0x00007f7f1c651000)
        libdrm.so.2 => /opt/amdgpu/lib64/libdrm.so.2 (0x00007f7f1c638000)
        libdrm_amdgpu.so.1 => /opt/amdgpu/lib64/libdrm_amdgpu.so.1 (0x00007f7f1c62b000)
        libnuma.so.1 => /lib64/libnuma.so.1 (0x00007f7f1c61d000)
        libamd_comgr.so.2 => /opt/rocm-6.1.0/lib/libamd_comgr.so.2 (0x00007f7f14004000)
        libz.so.1 => /lib64/libz.so.1 (0x00007f7f13fea000)
        libzstd.so.1 => /lib64/libzstd.so.1 (0x00007f7f13f13000)
        libtinfo.so.6 => /lib64/libtinfo.so.6 (0x00007f7f13ee1000)
jiridanek commented 1 month ago

Next up,

tf.config.list_physical_devices('GPU')

this does not find any gpus in the minimal image after I fix the missing library above, but it works fine in the full image. How can I find what's missing?

jiridanek commented 1 month ago
(app-root) bash-5.1# dnf remove hipblas-devel hipfort-devel rocprim-devel roctracer-devel composablekernel-devel hipify-clang hipsparse-devel hsa-rocr-devel migraphx-devel openmp-extras-devel rocblas-devel rocsolver-devel valgrind hipblaslt-devel hsakmt-roct-devel rocprofiler-devel valgrind-devel hipcub-devel hiprand-devel hipsparselt-devel miopen-hip-devel rccl-devel rocfft-devel rocm-opencl-devel rocprofiler-plugins rocsparse-devel hip-samples libdrm-amdgpu-devel rocm-opencl-sdk hipfft-devel hipsolver-devel hiptensor-devel libpciaccess-devel rocalution-devel rocm-hip-runtime-devel rocm-openmp-sdk rocrand-devel systemd-devel
Updating Subscription Management repositories.
Unable to read consumer identity

This system is not registered with an entitlement server. You can use subscription-manager to register.

Dependencies resolved.
=================================================================================================
 Package                  Arch     Version                         Repository               Size
=================================================================================================
Removing:
 composablekernel-devel   x86_64   1.1.0.60100-82.el9              @ROCm                   4.9 G
 hip-samples              x86_64   6.1.40091.60100-82.el9          @ROCm                   423 k
 hipblas-devel            x86_64   2.1.0.60100-82.el9              @ROCm                   2.1 M
 hipblaslt-devel          x86_64   0.7.0.60100-82.el9              @ROCm                   5.5 G
 hipcub-devel             x86_64   3.1.0.60100-82.el9              @ROCm                   831 k
 hipfft-devel             x86_64   1.0.14.60100-82.el9             @ROCm                    61 k
 hipfort-devel            x86_64   0.4.0.60100-82.el9              @ROCm                    86 M
 hipify-clang             x86_64   17.0.0.60100-82.el9             @ROCm                    64 M
 hiprand-devel            x86_64   2.10.16.60100-82.el9            @ROCm                   171 k
 hipsolver-devel          x86_64   2.1.0.60100-82.el9              @ROCm                   401 k
 hipsparse-devel          x86_64   3.0.1.60100-82.el9              @ROCm                   627 k
 hipsparselt-devel        x86_64   0.1.0.60100-82.el9              @ROCm                    66 k
 hiptensor-devel          x86_64   1.2.0.60100-82.el9              @ROCm                    66 k
 hsa-rocr-devel           x86_64   1.13.0.60100-82.el9             @ROCm                   530 k
 hsakmt-roct-devel        x86_64   20240125.3.30.60100-82.el9      @ROCm                   372 k
 libdrm-amdgpu-devel      x86_64   1:2.4.120.60100-1756574.el9     @amdgpu                 688 k
 libpciaccess-devel       x86_64   0.16-6.el9                      @ubi-9-appstream-rpms    15 k
 migraphx-devel           x86_64   2.9.0.60100-82.el9              @ROCm                   1.4 M
 miopen-hip-devel         x86_64   3.1.0.60100-82.el9              @ROCm                   305 k
 openmp-extras-devel      x86_64   17.60.0.60100-82.el9            @ROCm                   179 M
 rccl-devel               x86_64   2.18.6.60100-82.el9             @ROCm                    70 k
 rocalution-devel         x86_64   3.1.1.60100-82.el9              @ROCm                   362 k
 rocblas-devel            x86_64   4.1.0.60100-82.el9              @ROCm                   1.9 M
 rocfft-devel             x86_64   1.0.27.60100-82.el9             @ROCm                    41 k
 rocm-hip-runtime-devel   x86_64   6.1.0.60100-82.el9              @ROCm                     9  
 rocm-opencl-devel        x86_64   2.0.0.60100-82.el9              @ROCm                   837 k
 rocm-opencl-sdk          x86_64   6.1.0.60100-82.el9              @ROCm                     9  
 rocm-openmp-sdk          x86_64   6.1.0.60100-82.el9              @ROCm                     9  
 rocprim-devel            x86_64   3.1.0.60100-82.el9              @ROCm                   3.2 M
 rocprofiler-devel        x86_64   2.0.60100.60100-82.el9          @ROCm                   109 k
 rocprofiler-plugins      x86_64   2.0.60100.60100-82.el9          @ROCm                   3.8 M
 rocrand-devel            x86_64   3.0.1.60100-82.el9              @ROCm                   3.3 M
 rocsolver-devel          x86_64   3.25.0.60100-82.el9             @ROCm                   1.4 M
 rocsparse-devel          x86_64   3.1.2.60100-82.el9              @ROCm                   1.8 M
 roctracer-devel          x86_64   4.1.60100.60100-82.el9          @ROCm                   1.3 M
 systemd-devel            x86_64   252-32.el9_4                    @ubi-9-appstream-rpms   482 k
 valgrind                 x86_64   1:3.22.0-2.el9                  @ubi-9-appstream-rpms    29 M
 valgrind-devel           x86_64   1:3.22.0-2.el9                  @ubi-9-appstream-rpms   502 k
Removing dependent packages:
 rocm                     x86_64   6.1.0.60100-82.el9              @ROCm                     9  
Removing unused dependencies:
 amd-smi-lib              x86_64   24.4.1.60100-82.el9             @ROCm                   4.0 M
 half                     x86_64   1.12.0.60100-82.el9             @ROCm                   147 k
 hip-doc                  x86_64   6.1.40091.60100-82.el9          @ROCm                   215 k
 hipblas                  x86_64   2.1.0.60100-82.el9              @ROCm                   978 k
 hipblaslt                x86_64   0.7.0.60100-82.el9              @ROCm                   9.1 M
 hipfft                   x86_64   1.0.14.60100-82.el9             @ROCm                    62 k
 hiprand                  x86_64   2.10.16.60100-82.el9            @ROCm                    20 k
 hipsolver                x86_64   2.1.0.60100-82.el9              @ROCm                   487 k
 hipsparse                x86_64   3.0.1.60100-82.el9              @ROCm                   306 k
 hipsparselt              x86_64   0.1.0.60100-82.el9              @ROCm                   238 M
 hiptensor                x86_64   1.2.0.60100-82.el9              @ROCm                   327 M
 hsa-amd-aqlprofile       x86_64   1.0.0.60100.60100-82.el9        @ROCm                   892 k
 libatomic                x86_64   11.4.1-3.el9                    @ubi-9-baseos-rpms       29 k
 migraphx                 x86_64   2.9.0.60100-82.el9              @ROCm                   334 M
 miopen-hip               x86_64   3.1.0.60100-82.el9              @ROCm                   1.7 G
 mivisionx                x86_64   2.5.0.60100-82                  @ROCm                    54 M
 openblas                 x86_64   0.3.21-2.el9                    @ubi-9-appstream-rpms    82 k
 openblas-serial          x86_64   0.3.21-2.el9                    @ubi-9-appstream-rpms    38 M
 python3-pyyaml           x86_64   5.4.1-6.el9                     @ubi-9-baseos-rpms      673 k
 rocalution               x86_64   3.1.1.60100-82.el9              @ROCm                    83 M
 rocblas                  x86_64   4.1.0.60100-82.el9              @ROCm                   4.4 G
 rocfft                   x86_64   1.0.27.60100-82.el9             @ROCm                   2.0 G
 rocm-dbgapi              x86_64   0.71.0.60100-82.el9             @ROCm                   4.4 M
 rocm-debug-agent         x86_64   2.0.3.60100-82.el9              @ROCm                   134 k
 rocm-developer-tools     x86_64   6.1.0.60100-82.el9              @ROCm                     9  
 rocm-device-libs         x86_64   1.0.0.60100-82.el9              @ROCm                   3.2 M
 rocm-gdb                 x86_64   14.1.60100-82.el9               @ROCm                   193 M
 rocm-hip-libraries       x86_64   6.1.0.60100-82.el9              @ROCm                     9  
 rocm-hip-sdk             x86_64   6.1.0.60100-82.el9              @ROCm                     9  
 rocm-ml-libraries        x86_64   6.1.0.60100-82.el9              @ROCm                     9  
 rocm-ml-sdk              x86_64   6.1.0.60100-82.el9              @ROCm                     9  
 rocprofiler              x86_64   2.0.60100.60100-82.el9          @ROCm                   3.5 M
 rocrand                  x86_64   3.0.1.60100-82.el9              @ROCm                    60 M
 rocsolver                x86_64   3.25.0.60100-82.el9             @ROCm                   1.3 G
 rocsparse                x86_64   3.1.2.60100-82.el9              @ROCm                   1.3 G
 rocthrust-devel          x86_64   3.0.1.60100-82.el9              @ROCm                   5.4 M
 roctracer                x86_64   4.1.60100.60100-82.el9          @ROCm                   1.2 M
 rocwmma-devel            x86_64   1.4.0.60100-82.el9              @ROCm                   648 k
 rpp                      x86_64   1.5.0.60100-82.el9              @ROCm                    91 M
 sudo                     x86_64   1.9.5p2-10.el9_3                @ubi-9-baseos-rpms      4.0 M
 suitesparse              x86_64   5.4.0-10.el9                    @ubi-9-appstream-rpms   3.5 M
 tbb                      x86_64   2020.3-8.el9                    @ubi-9-appstream-rpms   489 k

Transaction Summary
=================================================================================================
Remove  81 Packages

Freed space: 23 G
Is this ok [y/N]: y

and this no longer works ;(

# python -c 'import tensorflow as tf; print(tf.config.list_physical_devices("GPU"))'
jiridanek commented 1 month ago
# python -c 'import tensorflow as tf; print(tf.config.list_physical_devices("GPU"))'
2024-07-23 16:02:09.830796: E external/local_xla/xla/stream_executor/plugin_registry.cc:93] Invalid plugin kind specified: DNN
2024-07-23 16:02:09.894891: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-07-23 16:02:13.457500: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2256] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
[]

but strace gives nothing useful

(app-root) bash-5.1# strace -f python -c 'import tensorflow as tf; print(tf.config.list_physical_devices("GPU"))' |& grep dlopen

this is not useful either

find /opt/app-root/lib64/python3.9/site-packages/ -name '*.so.*' -exec echo {} \; -exec ldd {} \;

but this IS useful

# TF_CPP_MAX_VLOG_LEVEL=3 python -c 'import tensorflow as tf; print(tf.config.list_physical_devices("GPU"))'
2024-07-23 16:10:22.694251: E external/local_xla/xla/stream_executor/plugin_registry.cc:93] Invalid plugin kind specified: DNN
2024-07-23 16:10:22.694361: I external/local_tsl/tsl/platform/cloud/gcs_file_system.cc:306] GCS RetryConfig: init_delay_time_us = 1000000 ; max_delay_time_us = 32000000 ; max_retries = 10
2024-07-23 16:10:22.694375: I external/local_tsl/tsl/platform/cloud/gcs_file_system.cc:855] GCS cache max size = 0 ; block size = 67108864 ; max staleness = 0
2024-07-23 16:10:22.694383: I external/local_tsl/tsl/platform/cloud/ram_file_block_cache.h:64] GCS file block cache is disabled
2024-07-23 16:10:22.694391: I external/local_tsl/tsl/platform/cloud/gcs_file_system.cc:895] GCS DNS cache is disabled, because GCS_RESOLVE_REFRESH_SECS = 0 (or is not set)
2024-07-23 16:10:22.694398: I external/local_tsl/tsl/platform/cloud/gcs_file_system.cc:925] GCS additional header DISABLED. No environment variable set.
2024-07-23 16:10:22.694405: I external/local_tsl/tsl/platform/cloud/gcs_file_system.cc:306] GCS RetryConfig: init_delay_time_us = 1000000 ; max_delay_time_us = 32000000 ; max_retries = 10
2024-07-23 16:10:22.750730: I external/local_tsl/tsl/platform/cloud/gcs_file_system.cc:306] GCS RetryConfig: init_delay_time_us = 1000000 ; max_delay_time_us = 32000000 ; max_retries = 10
2024-07-23 16:10:22.750774: I external/local_tsl/tsl/platform/cloud/gcs_file_system.cc:855] GCS cache max size = 0 ; block size = 67108864 ; max staleness = 0
2024-07-23 16:10:22.750782: I external/local_tsl/tsl/platform/cloud/ram_file_block_cache.h:64] GCS file block cache is disabled
2024-07-23 16:10:22.750792: I external/local_tsl/tsl/platform/cloud/gcs_file_system.cc:895] GCS DNS cache is disabled, because GCS_RESOLVE_REFRESH_SECS = 0 (or is not set)
2024-07-23 16:10:22.750799: I external/local_tsl/tsl/platform/cloud/gcs_file_system.cc:925] GCS additional header DISABLED. No environment variable set.
2024-07-23 16:10:22.750806: I external/local_tsl/tsl/platform/cloud/gcs_file_system.cc:306] GCS RetryConfig: init_delay_time_us = 1000000 ; max_delay_time_us = 32000000 ; max_retries = 10
2024-07-23 16:10:22.755794: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-07-23 16:10:26.009353: I external/local_xla/xla/parse_flags_from_env.cc:196] For env var TF_XLA_FLAGS found arguments:
2024-07-23 16:10:26.009467: I external/local_xla/xla/parse_flags_from_env.cc:198]   argv[0] = <argv[0]>
2024-07-23 16:10:26.009507: I external/local_xla/xla/parse_flags_from_env.cc:196] For env var TF_JITRT_FLAGS found arguments:
2024-07-23 16:10:26.009520: I external/local_xla/xla/parse_flags_from_env.cc:198]   argv[0] = <argv[0]>
2024-07-23 16:10:26.009555: I tensorflow/compiler/jit/xla_cpu_device.cc:46] Not creating XLA devices, tf_xla_enable_xla_devices not set and XLA device creation not requested
2024-07-23 16:10:26.009974: I external/local_tsl/tsl/platform/default/dso_loader.cc:57] Successfully opened dynamic library libamdhip64.so
2024-07-23 16:10:26.374922: I external/local_xla/xla/stream_executor/rocm/rocm_gpu_executor.cc:757] trying to read NUMA node for device ordinal: 0
2024-07-23 16:10:26.375026: I tensorflow/core/common_runtime/gpu/gpu_device.cc:2238] Found device 0 with properties: 
pciBusID: 0000:b3:00.0 name: AMD Instinct MI210     ROCm AMDGPU Arch: gfx90a:sramecc+:xnack-
coreClock: 1.7GHz coreCount: 104 deviceMemorySize: 62.98GiB deviceMemoryBandwidth: 1.49TiB/s
2024-07-23 16:10:26.375220: I external/local_tsl/tsl/platform/default/dso_loader.cc:63] Could not load dynamic library 'librocblas.so'; dlerror: librocblas.so: cannot open shared object file: No such file or directory
2024-07-23 16:10:26.375286: I external/local_tsl/tsl/platform/default/dso_loader.cc:63] Could not load dynamic library 'libMIOpen.so'; dlerror: libMIOpen.so: cannot open shared object file: No such file or directory
2024-07-23 16:10:26.375351: I external/local_tsl/tsl/platform/default/dso_loader.cc:63] Could not load dynamic library 'libhipfft.so'; dlerror: libhipfft.so: cannot open shared object file: No such file or directory
2024-07-23 16:10:26.375414: I external/local_tsl/tsl/platform/default/dso_loader.cc:63] Could not load dynamic library 'librocrand.so'; dlerror: librocrand.so: cannot open shared object file: No such file or directory
2024-07-23 16:10:26.375427: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2256] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2024-07-23 16:10:26.375466: I tensorflow/compiler/jit/xla_gpu_device.cc:49] Not creating XLA devices, tf_xla_enable_xla_devices not set and XLA devices creation not required
[]
jiridanek commented 1 month ago
# dnf install rocblas-devel rocrand-devel hipfft-devel miopen-hip-devel

takes care of the problem, and it has to be the devel versions, because they create unversioned so symlinks which tensorflow needs

jiridanek commented 1 month ago

aaand, failed again

>>> prediction = model(x_train[:1])
2024-07-23 16:23:12.402430: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:508] ROCm Fusion is enabled.
2024-07-23 16:23:12.404076: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:508] ROCm Fusion is enabled.
2024-07-23 16:23:12.408049: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:508] ROCm Fusion is enabled.
2024-07-23 16:23:12.409286: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:508] ROCm Fusion is enabled.
2024-07-23 16:23:12.410028: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:508] ROCm Fusion is enabled.
Segmentation fault (core dumped)

https://rocm.docs.amd.com/projects/install-on-linux/en/develop/install/3rd-party/tensorflow-install.html https://github.com/mpeschel10/test-tensorflow-rocm/blob/main/test_tensorflow.py

strace -f helps,

[pid  8435] openat(AT_FDCWD, "/opt/rocm-6.1.0/lib/libhipblaslt.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
[pid  8435] openat(AT_FDCWD, "/lib64/libhipblaslt.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
[pid  8435] openat(AT_FDCWD, "/usr/lib64/libhipblaslt.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
[pid  8435] munmap(0x7fe297eb9000, 28163) = 0
[pid  8435] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x58} ---

ergo

(app-root) bash-5.1# dnf install hipblaslt-devel

and now

2024-07-23 16:38:03.666986: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:508] ROCm Fusion is enabled.
2024-07-23 16:38:03.686775: E external/local_xla/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:237] bitcode module is required by this HLO module but was not found at /opt/rocm-6.1.0/amdgcn/bitcode/opencl.bc
error: Failure when generating HSACO
2024-07-23 16:38:03.687214: E tensorflow/compiler/mlir/tools/kernel_gen/tf_framework_c_interface.cc:207] INTERNAL: Generating device code failed.
2024-07-23 16:38:03.687842: W tensorflow/core/framework/op_kernel.cc:1827] UNKNOWN: JIT compilation failed.
Traceback (most recent call last):
  File "/opt/app-root/src/test_tensorflow.py", line 40, in <module>
    prediction = model(x_train[:1])
  File "/opt/app-root/lib64/python3.9/site-packages/keras/src/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/opt/app-root/lib64/python3.9/site-packages/keras/src/backend.py", line 5395, in relu
    x = tf.nn.relu(x)
tensorflow.python.framework.errors_impl.UnknownError: Exception encountered when calling layer 'dense' (type Dense).

{{function_node __wrapped__Relu_device_/job:localhost/replica:0/task:0/device:GPU:0}} JIT compilation failed. [Op:Relu] name: 

Call arguments received by layer 'dense' (type Dense):
  • inputs=tf.Tensor(shape=(1, 784), dtype=float32)
# dnf whatprovides '*/opencl.bc'
Updating Subscription Management repositories.
Unable to read consumer identity

This system is not registered with an entitlement server. You can use subscription-manager to register.

Last metadata expiration check: 1:34:49 ago on Tue 23 Jul 2024 03:05:30 PM UTC.
rocm-device-libs-1.0.0.60100-82.el9.x86_64 : Radeon Open Compute - device libraries
Repo        : ROCm
Matched from:
Filename    : /opt/rocm-6.1.0/lib/llvm/lib/clang/17/lib/amdgcn/bitcode/opencl.bc

...

And with that, the sample compute task succeeded!

jiridanek commented 1 month ago
# dnf repoquery --whatrequires rocm-device-libs
Updating Subscription Management repositories.
Unable to read consumer identity

This system is not registered with an entitlement server. You can use subscription-manager to register.

Last metadata expiration check: 1:38:17 ago on Tue 23 Jul 2024 03:05:30 PM UTC.
openmp-extras-devel-0:17.60.0.60100-82.el9.x86_64
rocm-dev-0:6.1.0.60100-82.el9.x86_64
rocm-hip-runtime-devel-0:6.1.0.60100-82.el9.x86_64
rocm-openmp-sdk-0:6.1.0.60100-82.el9.x86_64

there does not seem to be a good top level dep to install to get rocm-device-libs

jiridanek commented 1 month ago

Torch seems to be working out of the box,

>>> torch.cuda.is_available()
True
>>> torch.cuda.device_count()
1
>>> torch.cuda.current_device()
0
>>> torch.cuda.device(0)
<torch.cuda.device object at 0x7fb1b29a7e80>
>>> torch.cuda.get_device_name(0)
'AMD Instinct MI210'
>>> torch.cuda.is_available()
True
>>> torch.tensor([5, 5, 5], dtype=torch.int64, device='cuda')
tensor([5, 5, 5], device='cuda:0')
jiridanek commented 1 month ago
(app-root) bash-5.1# python3 -c 'import torch; print(torch.version.hip)'
6.0.32830-d62f6a171
# pip3 install torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.0
openshift-ci[bot] commented 1 month ago

@jiridanek: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/rocm-notebooks-e2e-tests 093ff53b4ecbbd638395c07861932fce03e71b27 link true /test rocm-notebooks-e2e-tests
ci/prow/images 093ff53b4ecbbd638395c07861932fce03e71b27 link true /test images
ci/prow/notebook-rocm-jupyter-tf-ubi9-python-3-9-pr-image-mirror 093ff53b4ecbbd638395c07861932fce03e71b27 link true /test notebook-rocm-jupyter-tf-ubi9-python-3-9-pr-image-mirror
ci/prow/notebook-rocm-jupyter-pyt-ubi9-python-3-9-pr-image-mirror 093ff53b4ecbbd638395c07861932fce03e71b27 link true /test notebook-rocm-jupyter-pyt-ubi9-python-3-9-pr-image-mirror

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
jiridanek commented 1 month ago

This did not go well. By adding the libs that were needed for Tensorflow, I ended up bloating up the image to the original size ;( There is fewer things installed, but it does not matter. Trivy is still unhappy scanning the result.

caponetto commented 1 month ago

=/ In this case, I will disable the image scan for rocm-jupyter-pytorch-ubi9-python-3.9. Instead, we can at least run the scan for the lock file, which I believe is faster and doesn't require dealing with the image.