spack / spack

A flexible package manager that supports multiple versions, configurations, platforms, and compilers.
https://spack.io
Other
4.26k stars 2.26k forks source link

Installation issue: Bazel 3.7.2 - 4.2.2 #35281

Open serikkehva opened 1 year ago

serikkehva commented 1 year ago

Steps to reproduce the issue

$ spack install -v py-horovod frameworks=tensorflow,keras bazel@4.1.0
or
$ spack install -v py-horovod frameworks=tensorflow,keras bazel@3.7.2
or
$ spack install -v py-horovod frameworks=tensorflow,keras
Error message

### Error message
ERROR: /mnt/storage_2/scratch/grant_531/tempfiles/spack-stage/spack-stage-bazel-4.2.2-whsmwrnuqai5dzflmvdhpo7gpz37xlu3/spack-src/tools/jdk/BUILD:346:14: Action tools/jdk/platformclasspath.jar failed: (Exit 1): java failed: error executing command 
  (cd /tmp/bazel_M4P7cLZ2/out/execroot/io_bazel && \
  exec env - \
  external/local_jdk/bin/java -XX:+IgnoreUnrecognizedVMOptions '--add-exports=jdk.compiler/com.sun.tools.javac.api=ALL-UNNAMED' '--add-exports=jdk.compiler/com.sun.tools.javac.platform=ALL-UNNAMED' '--add-exports=jdk.compiler/com.sun.tools.javac.util=ALL-UNNAMED' -cp bazel-out/k8-opt-exec-EDC14992/bin/tools/jdk/platformclasspath_classes:external/local_jdk/lib/tools.jar DumpPlatformClassPath bazel-out/k8-opt-exec-EDC14992/bin/tools/jdk/platformclasspath.jar external/local_jdk)
Execution platform: //:default_host_platform
[1,163 / 1,177] checking cached actions

ERROR: I/O error while writing action log: No space left on device

[1,163 / 1,177] checking cached actions

Target //src:bazel_nojdk failed to build
[1,163 / 1,177] checking cached actions

ERROR: Error while writing profile file: No space left on device
[1,163 / 1,177] checking cached actions

INFO: Elapsed time: 1078.218s, Critical Path: 58.86s
[1,163 / 1,177] checking cached actions

INFO: 1163 processes: 50 internal, 1104 local, 9 worker.
[1,163 / 1,177] checking cached actions

FAILED: Build did NOT complete successfully

FAILED: Build did NOT complete successfully

ERROR: Could not build Bazel
==> Error: ProcessError: Command exited with status 1:
    '/usr/bin/bash' './compile.sh'

7 errors found in build log:
     19       
     20       
     21       Loading: 0 packages loaded
     22           Fetching @bazel_toolchains; fetching
     23       
     24       
  >> 25       DEBUG: /tmp/bazel_M4P7cLZ2/out/external/bazel_toolchains/rules/rb
              e_repo/version_check.bzl:59:14:
     26       Current running Bazel is not a release version and one was not de
              fined explicitly in rbe_autoconfig target. Falling back to '3.1.0
              '
     27       Loading: 0 packages loaded
     28       
  >> 29       DEBUG: /tmp/bazel_M4P7cLZ2/out/external/bazel_toolchains/rules/rb
              e_repo/checked_in.bzl:103:14: rbe_ubuntu1804_java11 not using che
              cked in configs as detect_java_home was set to True
     30       Loading: 0 packages loaded
     31       
  >> 32       DEBUG: /tmp/bazel_M4P7cLZ2/out/external/bazel_toolchains/rules/rb
              e_repo/version_check.bzl:59:14:
     33       Current running Bazel is not a release version and one was not de
              fined explicitly in rbe_autoconfig target. Falling back to '3.1.0
              '
     34       Loading: 0 packages loaded
     35       
  >> 36       DEBUG: /tmp/bazel_M4P7cLZ2/out/external/bazel_toolchains/rules/rb
              e_repo/checked_in.bzl:103:14: rbe_ubuntu1604_java8 not using chec
              ked in configs as detect_java_home was set to True
     37       Loading: 0 packages loaded
     38       
     39       Loading: 0 packages loaded
     40           Fetching @io_bazel_skydoc; fetching
     41           Fetching ...tar.gz; Checking SHA-256 of /mnt/storage_2/scratc
              h/grant_531/t\
     42       empfiles/spack-stage/spack-stage-bazel-4.2.2-whsmwrnuqai5dzflmvdh
              po7gpz37xlu3/\

     ...

     91715    [1,162 / 1,177] [Prepa] Action tools/jdk/platformclasspath.jar
     91716    
     91717    [1,162 / 1,177] Action tools/jdk/platformclasspath.jar; 0s local
     91718    
     91719    [1,162 / 1,177] Action tools/jdk/platformclasspath.jar; 1s local
     91720    
  >> 91721    ERROR: /mnt/storage_2/scratch/grant_531/tempfiles/spack-stage/spa
              ck-stage-bazel-4.2.2-whsmwrnuqai5dzflmvdhpo7gpz37xlu3/spack-src/t
              ools/jdk/BUILD:346:14: Action tools/jdk/platformclasspath.jar fai
              led: (Exit 1): java failed: error executing command
     91722      (cd /tmp/bazel_M4P7cLZ2/out/execroot/io_bazel && \
     91723      exec env - \
     91724      external/local_jdk/bin/java -XX:+IgnoreUnrecognizedVMOptions '-
              -add-exports=jdk.compiler/com.sun.tools.javac.api=ALL-UNNAMED' '-
              -add-exports=jdk.compiler/com.sun.tools.javac.platform=ALL-UNNAME
              D' '--add-exports=jdk.compiler/com.sun.tools.javac.util=ALL-UNNAM
              ED' -cp bazel-out/k8-opt-exec-EDC14992/bin/tools/jdk/platformclas
              spath_classes:external/local_jdk/lib/tools.jar DumpPlatformClassP
              ath bazel-out/k8-opt-exec-EDC14992/bin/tools/jdk/platformclasspat
              h.jar external/local_jdk)
     91725    Execution platform: //:default_host_platform
     91726    [1,163 / 1,177] checking cached actions
     91727    

     ...

     91737    INFO: Elapsed time: 1078.218s, Critical Path: 58.86s
     91738    [1,163 / 1,177] checking cached actions
     91739    
     91740    INFO: 1163 processes: 50 internal, 1104 local, 9 worker.
     91741    [1,163 / 1,177] checking cached actions
     91742    
  >> 91743    FAILED: Build did NOT complete successfully
     91744    
  >> 91745    FAILED: Build did NOT complete successfully
     91746    
     91747    ERROR: Could not build Bazel

Information on your system

Additional information

spack-build-out.txt Detailed error info (-verbose) in error message. I have tried building this with both :

$ java -version
openjdk version "1.8.0_352"
OpenJDK Runtime Environment (build 1.8.0_352-b08)
OpenJDK 64-Bit Server VM (build 25.352-b08, mixed mode)

and

$ java -version
openjdk version "11.0.17" 2022-10-18
OpenJDK Runtime Environment Temurin-11.0.17+8 (build 11.0.17+8)
OpenJDK 64-Bit Server VM Temurin-11.0.17+8 (build 11.0.17+8, mixed mode)

Also I have tried several Bazel version, all resulting in similar error. @adamjstewart @aweits

General information

adamjstewart commented 1 year ago

Can you upload the build log (spack-build-out.txt)? The error message should end with the path to this file.

serikkehva commented 1 year ago

I have added spack-build-out.txt. I had to try the installation one more time and the error occured in different stage of installation, nevertheless seems really similar to the previous one.

adamjstewart commented 1 year ago

Do newer versions of bazel work for you? Is there any reason you're trying to use these older versions?

serikkehva commented 1 year ago

To be honest I'haven't tried them. Bazel 4.2.2 is preferred for py-horovod package. After getting problems with 4.2v I have switched to more popular (at least in the issues) versions to see if the errors match. I will give newer versions a try.

adamjstewart commented 1 year ago

py-horovod doesn't use bazel...

These versions are more popular in the issues because they have more issues. I would suggest using the newest version you can for all packages. They usually include important bug fixes and are better tested.

serikkehva commented 1 year ago

I don't understands the first sentence. Bazel is in the dependency list of py-horovod and needs to be installed if you specify tf and keras frameworks.

I have tried installing bazel 5.2.0 both with java 1.8.0.352 and 11.0.17 spack-build-out_bazel5_2_0.txt spack-build-out_bazel5_2_0_java11.txt

adamjstewart commented 1 year ago

What I'm saying is that bazel is not a direct dependency of horovod, it's TF and Keras that need it. And TF builds fine with bazel 5.1.1, so I don't understand why you would want to build bazel 3 or 4.

From the error logs:

ERROR: An error occurred during the fetch of repository 'remotejdk11_linux':
   Traceback (most recent call last):
        File "/tmp/bazel_kw0Daf3Q/out/external/bazel_tools/tools/build_defs/repo/http.bzl", line 100, column 45, in _http_archive_impl
                download_info = ctx.download_and_extract(
Error in download_and_extract: java.io.IOException: No space left on device

This is probably your issue. Can you try changing your build_stage to a filesystem with more storage? Also note that Bazel will crash on NFS, so be careful which filesystem you choose. Not sure if this will help with TensorFlow where we build in tempfile.mkdtemp(). But it's a start at least.

serikkehva commented 1 year ago

Changing build_stage unfortunately didn't help but setting TMPDIR did. I managed to install bazel 5.2.0 with defining both build_stage and tmpdir to some custom directories.

adamjstewart commented 1 year ago

Glad you got it working! Not sure if there's a better way to choose the default TMPDIR location. We could set it to the build stage for Spack, but the NFS issue makes it tricky to choose a default.