openxla / xla

A machine learning compiler for GPUs, CPUs, and ML accelerators
Apache License 2.0
2.58k stars 403 forks source link

Mac M1 build failure with `--config=monolithic`: no such target '@local_config_rocm//rocm:hipfft' #2360

Closed joelberkeley closed 1 year ago

joelberkeley commented 1 year ago

I'm running the docker build on a mac M1, which isn't documented but I managed to get some of the way there by adding --platform linux/x86_64/v8 to my docker run command. I've used the default configuration, but with option --config=monolithic in

docker exec xla bazel build --test_output=all --spawn_strategy=sandboxed --nocheck_visibility --config=monolithic //xla/...

I'm seeing the error

ERROR: /xla/xla/stream_executor/rocm/BUILD:205:11: no such target '@local_config_rocm//rocm:hipfft': target 'hipfft' not declared in package 'rocm' defined by /root/.cache/bazel/_bazel_root/e4ab50d61a21943a819d1e092972a817/external/local_config_rocm/rocm/BUILD and referenced by '//xla/stream_executor/rocm:hipfft_if_static'
ERROR: Analysis of target '//xla/stream_executor/rocm:hipfft_if_static' failed; build aborted: Analysis failed

which is surprising since I have used the default config so am not expecting rocm to be involved.

The full logs are

Extracting Bazel installation...
Starting local Bazel server and connecting to it...
INFO: Options provided by the client:
  Inherited 'common' options: --isatty=0 --terminal_columns=80
INFO: Reading rc options for 'build' from /xla/.bazelrc:
  Inherited 'common' options: --experimental_repo_remote_exec
INFO: Reading rc options for 'build' from /etc/bazel.bazelrc:
  'build' options: --action_env=DOCKER_CACHEBUSTER=1680717592238941475 --host_action_env=DOCKER_HOST_CACHEBUSTER=1680717592321081121
INFO: Reading rc options for 'build' from /xla/.bazelrc:
  'build' options: --define framework_shared_object=true --define tsl_protobuf_header_only=true --define=use_fast_cpp_protos=true --define=allow_oversize_protos=true --spawn_strategy=standalone -c opt --announce_rc --define=grpc_no_ares=true --noincompatible_remove_legacy_whole_archive --enable_platform_specific_config --define=with_xla_support=true --config=short_logs --config=v2 --define=no_aws_support=true --define=no_hdfs_support=true --experimental_cc_shared_library --experimental_link_static_libraries_once=false --incompatible_enforce_config_setting_visibility
INFO: Reading rc options for 'build' from /xla/.tf_configure.bazelrc:
  'build' options: --action_env PYTHON_BIN_PATH=/usr/bin/python3 --action_env PYTHON_LIB_PATH=/usr/lib/python3/dist-packages --python_path=/usr/bin/python3 --config=nonccl --test_tag_filters=-benchmark-test,-no_oss,-oss_excluded,-gpu,-oss_serial --build_tag_filters=-benchmark-test,-no_oss,-oss_excluded,-gpu
INFO: Reading rc options for 'build' from /xla/.bazelrc:
  'build' options: --deleted_packages=tensorflow/compiler/mlir/tfrt,tensorflow/compiler/mlir/tfrt/benchmarks,tensorflow/compiler/mlir/tfrt/jit/python_binding,tensorflow/compiler/mlir/tfrt/jit/transforms,tensorflow/compiler/mlir/tfrt/python_tests,tensorflow/compiler/mlir/tfrt/tests,tensorflow/compiler/mlir/tfrt/tests/ir,tensorflow/compiler/mlir/tfrt/tests/analysis,tensorflow/compiler/mlir/tfrt/tests/jit,tensorflow/compiler/mlir/tfrt/tests/lhlo_to_tfrt,tensorflow/compiler/mlir/tfrt/tests/lhlo_to_jitrt,tensorflow/compiler/mlir/tfrt/tests/tf_to_corert,tensorflow/compiler/mlir/tfrt/tests/tf_to_tfrt_data,tensorflow/compiler/mlir/tfrt/tests/saved_model,tensorflow/compiler/mlir/tfrt/transforms/lhlo_gpu_to_tfrt_gpu,tensorflow/core/runtime_fallback,tensorflow/core/runtime_fallback/conversion,tensorflow/core/runtime_fallback/kernel,tensorflow/core/runtime_fallback/opdefs,tensorflow/core/runtime_fallback/runtime,tensorflow/core/runtime_fallback/util,tensorflow/core/tfrt/eager,tensorflow/core/tfrt/eager/backends/cpu,tensorflow/core/tfrt/eager/backends/gpu,tensorflow/core/tfrt/eager/core_runtime,tensorflow/core/tfrt/eager/cpp_tests/core_runtime,tensorflow/core/tfrt/gpu,tensorflow/core/tfrt/run_handler_thread_pool,tensorflow/core/tfrt/runtime,tensorflow/core/tfrt/saved_model,tensorflow/core/tfrt/graph_executor,tensorflow/core/tfrt/saved_model/tests,tensorflow/core/tfrt/tpu,tensorflow/core/tfrt/utils
INFO: Found applicable config definition build:short_logs in file /xla/.bazelrc: --output_filter=DONT_MATCH_ANYTHING
INFO: Found applicable config definition build:v2 in file /xla/.bazelrc: --define=tf_api_version=2 --action_env=TF2_BEHAVIOR=1
INFO: Found applicable config definition build:nonccl in file /xla/.bazelrc: --define=no_nccl_support=true
INFO: Found applicable config definition build:monolithic in file /xla/.bazelrc: --define framework_shared_object=false --define tsl_protobuf_header_only=false --experimental_link_static_libraries_once=false
INFO: Found applicable config definition build:linux in file /xla/.bazelrc: --host_copt=-w --copt=-Wno-all --copt=-Wno-extra --copt=-Wno-deprecated --copt=-Wno-deprecated-declarations --copt=-Wno-ignored-attributes --copt=-Wno-array-bounds --copt=-Wunused-result --copt=-Werror=unused-result --copt=-Wswitch --copt=-Werror=switch --copt=-Wno-error=unused-but-set-variable --define=PREFIX=/usr --define=LIBDIR=$(PREFIX)/lib --define=INCLUDEDIR=$(PREFIX)/include --define=PROTOBUF_INCLUDE_PATH=$(PREFIX)/include --cxxopt=-std=c++17 --host_cxxopt=-std=c++17 --config=dynamic_kernels --experimental_guard_against_concurrent_changes
INFO: Found applicable config definition build:dynamic_kernels in file /xla/.bazelrc: --define=dynamic_loaded_kernels=true --copt=-DAUTOLOAD_DYNAMIC_KERNELS
Loading: 
Loading: 0 packages loaded
DEBUG: /xla/third_party/repo.bzl:132:14: 
Warning: skipping import of repository 'tf_runtime' because it already exists.
DEBUG: /xla/third_party/repo.bzl:132:14: 
Warning: skipping import of repository 'llvm-raw' because it already exists.
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
WARNING: Download from https://storage.googleapis.com/mirror.tensorflow.org/github.com/tensorflow/runtime/archive/0aaa6e679847a4eeb407136e7b0bcef93ec652e6.tar.gz failed: class java.io.FileNotFoundException GET returned 404 Not Found
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
WARNING: Download from https://storage.googleapis.com/mirror.tensorflow.org/github.com/llvm/llvm-project/archive/99fc6ec34cc1b023a837830d266fbbd523a509c3.tar.gz failed: class java.io.FileNotFoundException GET returned 404 Not Found
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 5 packages loaded
    currently loading: xla/python/tpu_driver/client ... (14 packages)
WARNING: Download from https://mirror.bazel.build/github.com/bazelbuild/rules_cc/archive/081771d4a0e9d7d3aa0eed2ef389fa4700dfb23e.tar.gz failed: class java.io.FileNotFoundException GET returned 404 Not Found
Analyzing: 2545 targets (101 packages loaded, 0 targets configured)
Analyzing: 2545 targets (120 packages loaded, 1588 targets configured)
WARNING: Download from https://storage.googleapis.com/mirror.tensorflow.org/github.com/openxla/stablehlo/archive/a2c36eb790c5e70109cf3c2b55f43dcdc779727e.zip failed: class java.io.FileNotFoundException GET returned 404 Not Found
Analyzing: 2545 targets (156 packages loaded, 1896 targets configured)
WARNING: Download from https://storage.googleapis.com/mirror.tensorflow.org/github.com/pybind/pybind11_abseil/archive/2c4932ed6f6204f1656e245838f4f5eae69d2e29.tar.gz failed: class java.io.FileNotFoundException GET returned 404 Not Found
Analyzing: 2545 targets (226 packages loaded, 2989 targets configured)
ERROR: /xla/xla/stream_executor/rocm/BUILD:205:11: no such target '@local_config_rocm//rocm:hipfft': target 'hipfft' not declared in package 'rocm' defined by /root/.cache/bazel/_bazel_root/e4ab50d61a21943a819d1e092972a817/external/local_config_rocm/rocm/BUILD and referenced by '//xla/stream_executor/rocm:hipfft_if_static'
INFO: Repository com_google_ortools instantiated at:
  /xla/WORKSPACE:19:15: in <toplevel>
  /xla/workspace2.bzl:84:21: in workspace
  /xla/workspace2.bzl:48:20: in _tf_repositories
  /xla/third_party/repo.bzl:136:21: in tf_http_archive
Repository rule _tf_http_archive defined at:
  /xla/third_party/repo.bzl:89:35: in <toplevel>
INFO: Repository com_google_benchmark instantiated at:
  /xla/WORKSPACE:19:15: in <toplevel>
  /xla/workspace2.bzl:70:19: in workspace
  /root/.cache/bazel/_bazel_root/e4ab50d61a21943a819d1e092972a817/external/tsl/workspace2.bzl:613:28: in workspace
  /root/.cache/bazel/_bazel_root/e4ab50d61a21943a819d1e092972a817/external/tsl/workspace2.bzl:43:14: in _initialize_third_party
  /root/.cache/bazel/_bazel_root/e4ab50d61a21943a819d1e092972a817/external/tsl/third_party/benchmark/workspace.bzl:9:20: in repo
  /root/.cache/bazel/_bazel_root/e4ab50d61a21943a819d1e092972a817/external/tsl/third_party/repo.bzl:136:21: in tf_http_archive
Repository rule _tf_http_archive defined at:
  /root/.cache/bazel/_bazel_root/e4ab50d61a21943a819d1e092972a817/external/tsl/third_party/repo.bzl:89:35: in <toplevel>
ERROR: Analysis of target '//xla/stream_executor/rocm:hipfft_if_static' failed; build aborted: Analysis failed
INFO: Elapsed time: 139.788s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (228 packages loaded, 9941 targets configured)
FAILED: Build did NOT complete successfully (228 packages loaded, 9941 targets configured)
tpopp commented 1 year ago

Hello,

There are two things:

  1. Are you using the docker containers and configure script as described at https://github.com/openxla/xla/blob/main/docs/developer_guide.md? If you are, then this does seem like a problem that the target is not found.
  2. You did specify rocm in a different way than you are thinking. You tried to build all targets with //xla/... and a subset of those targets are rocm targets.
joelberkeley commented 1 year ago

@tpopp

  1. yes
  2. I don't understand. //xla/... is in the developer guide. I'm just confused as to why it's failing on something to do with rocm when the default config doesn't, from what i understand, specify a rocm build
joelberkeley commented 1 year ago

btw i tried without --config=monolith and it gets a lot further, though i stopped the build after an hour

tpopp commented 1 year ago

Without --config=monolithic might very well succeed because then you are not trying to statically link everything, including the rocm libraries that you don't have. Alternatively, test instead of build might also work for you and be closer to what you are expecting because then it will only build targets necessary for tests and tested binaries that you are trying to run instead of all targets including the rocm ones.

You're expecting that //xla/... will somehow filter based on other configurations, but that is not quite right. //xla/... is going to build every single target. So a configuration might configure a target to not take a dependency on rocm, but that doesn't matter because you are also explicitly requesting that the rocm target be built, regardless of if it is used or not, and even requesting that all dependencies be required by trying to statically link everything.

Why are you using --config=monolithic?

tpopp commented 1 year ago

It does seem like we should probably add if_rocm_configured guards to the dependencies of the static targets here which would hopefully fix this: https://github.com/openxla/xla/blob/main/xla/stream_executor/rocm/BUILD

They seem to have been added after previous work to make all targets compile regardless of the configuration. I don't think there is a reason to treat the static targets differently in this regards.

joelberkeley commented 1 year ago

Without --config=monolithic might very well succeed because then you are not trying to statically link everything, including the rocm libraries that you don't have. Alternatively, test instead of build might also work for you and be closer to what you are expecting because then it will only build targets necessary for tests and tested binaries that you are trying to run instead of all targets including the rocm ones.

I'm using XLA in my own project, rather than working on XLA itself, so I don't need to test XLA

You're expecting that //xla/... will somehow filter based on other configurations, but that is not quite right. //xla/... is going to build every single target. So a configuration might configure a target to not take a dependency on rocm, but that doesn't matter because you are also explicitly requesting that the rocm target be built, regardless of if it is used or not, and even requesting that all dependencies be required by trying to statically link everything.

I'm not expecting anything that specific. I'm trying to figure out the build process and I've not seen it suggested anywhere that //xla/... might not be appropriate. Should that be modified too?

Why are you using --config=monolithic?

I ideally want a single static library that I can link into my own XLA wrapper to build a single dynamic library. --config=monolithic looked like a possible candidate for that

tpopp commented 1 year ago

I apologize if I came off rude by the way.

So, //xla/... refers to everything with bazel, and this is listed, just to show an example command and to try building everything to ensure everything works. also. The use of --config=monolithic also seems like a good choice.

For your use case, other specific libraries could be targeted like //xla:c_srcs I don't think there is a single top-level library for XLA that contains everything, so you would have to figure out which targets/libraries you wanted for your use case. I might be wrong though and will ask more appropriate people tomorrow.

joelberkeley commented 1 year ago

Ah that's useful info, thanks, and it's all good

joelberkeley commented 1 year ago

I just ran it with //xla/... and without --config=monolithic in github actions and got

 ERROR: /github/home/.cache/bazel/_bazel_root/b2b44df59c1647c561a99e141342b63f/external/llvm-project/mlir/BUILD.bazel:6691:11: Compiling mlir/lib/Analysis/CFGLoopInfo.cpp failed: (Exit 1): gcc failed: error executing command /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 119 arguments skipped)

Use --sandbox_debug to see verbose messages from the sandbox and retain the sandbox build root for debugging
In file included from external/llvm-project/mlir/lib/Analysis/CFGLoopInfo.cpp:9:
external/llvm-project/mlir/include/mlir/Analysis/CFGLoopInfo.h:20:10: fatal error: llvm/Analysis/LoopInfo.h: No such file or directory
   20 | #include "llvm/Analysis/LoopInfo.h"
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~
tpopp commented 1 year ago

Unfortunately, that other failure was a temporary breakage that should be fixed now. The linked PR will hopefully fix your issue. It worked in my local testing at least.

tpopp commented 1 year ago

Can you see if your issues are resolved after? https://github.com/openxla/xla/commit/d3978599502f1e6e0e8d9a3fd89adfb42c8bd2fc

joelberkeley commented 1 year ago

appears to have got past that bug for

bazel build --test_output=all --spawn_strategy=sandboxed --nocheck_visibility //xla/...

in github actions, though it's still going after 3.5 hours so I don't know if or when it will finish

tpopp commented 1 year ago

Hopefully with monolithic will work now also. I confirmed that there is no single top level target containing everything unfortunately. //xla/xla/service/gpu:gpu_compiler, //xla/xla/service/cpu:cpu_compiler, and //xla/xla/runtime:executable might contain the symbols you want though.

Also, the long build times are expected, so definitely set up bazel caching on your github action if you haven't and want faster results.

joelberkeley commented 1 year ago

Thanks.

The github build timed out after 6 hours so no indication there was any errors.

It would be really useful to have some kind of guide to how to navigate the build, specifically what to include in my bazel build if I want a set of symbols, especially since builds take so long and each symbol is built by a lot of different bazel targets. I'd understand if this isn't something you have the time to do, but it would be extremely useful for me.

joelberkeley commented 1 year ago

actually, I should be able make a fair bit of headway by scanning the BUILD files for the headers I'm using. Are //xla/xla/service/gpu:gpu_compiler, //xla/xla/service/cpu:cpu_compiler, and //xla/xla/runtime:executable mutually exclusive (I'm guessing the CPU/GPU ones are, what about runtime:executable)?

tpopp commented 1 year ago

I don't understand "mutually exclusive" in this context. They should be exposing different functionality but probably depend on some of the same underlying libraries. One might build Tensorflow with support of both the gpu_compiler, cpu_compiler, and executable, at the same time, so they can be used together, assuming they are used in a way that avoids ODR violations from any underlying functionality (I'm not great with linking/loading knowledge, so maybe that's not even a concern).

tpopp commented 1 year ago

I'm closing this as I don't think there are more action items, but please re-open as needed.