Open dulingkang opened 4 months ago
I add some auto sharding codes in xla/hlo/experimental/auto_sharding folder, but build has an error. The code can be found in this PR. build using command:
xla/hlo/experimental/auto_sharding
bazel clean --expunge bazel build --repo_env=HERMETIC_PYTHON_VERSION=3.9 --test_output=all --spawn_strategy=sandboxed //xla/... --python_path=/usr/bin/python3.9 --sandbox_debug
Env: python version is Python3.9 nvcc version:
nvcc -V ─╯ nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Tue_Aug_15_22:02:13_PDT_2023 Cuda compilation tools, release 12.2, V12.2.140 Build cuda_12.2.r12.2/compiler.33191640_0
clang version:
clang --version ─╯ Ubuntu clang version 17.0.6 (++20231208085846+6009708b4367-1~exp1~20231208085949.74) Target: x86_64-pc-linux-gnu Thread model: posix InstalledDir: /usr/bin
The error msg:
ERROR: /home/dss/work/xla-mesha/xla/xla/service/gpu/runtime/BUILD:112:11: Compiling xla/service/gpu/runtime/command_buffer_cmd_emitter.cc failed: (Exit 1): linux-sandbox failed: error executing command (cd /home/dss/.cache/bazel/_bazel_dss/4bc65a6054bf39baecdeb1fdf79bd001/sandbox/linux-sandbox/21666/execroot/xla && \ exec env - \ LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/ucx/lib:/usr/local/cuda/lib64: \ PATH=/home/dss/venv/py39/bin:/usr/local/openmpi/bin:/usr/local/ucx/bin:/usr/local/cuda/bin:/home/dss/.vscode/extensions/ms-python.python-2024.8.0/python_files/deactivate/bash:/home/dss/venv/py38/bin:/home/dss/.vscode/extensions/ms-python.python-2024.8.0/python_files/deactivate/bash:/home/dss/venv/py38/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin \ PWD=/proc/self/cwd \ TF2_BEHAVIOR=1 \ TMPDIR=/tmp \ /home/dss/.cache/bazel/_bazel_dss/install/20da5ab742b8d3d499c34fdafcd3c8b8/linux-sandbox -t 15 -w /home/dss/.cache/bazel/_bazel_dss/4bc65a6054bf39baecdeb1fdf79bd001/sandbox/linux-sandbox/21666/execroot/xla -w /tmp -w /dev/shm -S /home/dss/.cache/bazel/_bazel_dss/4bc65a6054bf39baecdeb1fdf79bd001/sandbox/linux-sandbox/21666/stats.out -D -- /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections -fdata-sections '-std=c++14' -MD -MF bazel-out/k8-opt/bin/xla/service/gpu/runtime/_objs/command_buffer_cmd_emitter/command_buffer_cmd_emitter.d '-frandom-seed=bazel-out/k8-opt/bin/xla/service/gpu/runtime/_objs/command_buffer_cmd_emitter/command_buffer_cmd_emitter.o' '-DEIGEN_MAX_ALIGN_BYTES=64' -DEIGEN_ALLOW_UNALIGNED_SCALARS '-DEIGEN_USE_AVX512_GEMM_KERNELS=0' -DHAVE_SYS_UIO_H -DTF_USE_SNAPPY '-DLLVM_ON_UNIX=1' '-DHAVE_BACKTRACE=1' '-DBACKTRACE_HEADER=<execinfo.h>' '-DLTDL_SHLIB_EXT=".so"' '-DLLVM_PLUGIN_EXT=".so"' '-DLLVM_ENABLE_THREADS=1' '-DHAVE_DEREGISTER_FRAME=1' '-DHAVE_LIBPTHREAD=1' '-DHAVE_PTHREAD_GETNAME_NP=1' '-DHAVE_PTHREAD_H=1' '-DHAVE_PTHREAD_SETNAME_NP=1' '-DHAVE_REGISTER_FRAME=1' '-DHAVE_SETENV_R=1' '-DHAVE_STRERROR_R=1' '-DHAVE_SYSEXITS_H=1' '-DHAVE_UNISTD_H=1' -D_GNU_SOURCE '-DHAVE_LINK_H=1' '-DHAVE_MALLINFO=1' '-DHAVE_SBRK=1' '-DHAVE_STRUCT_STAT_ST_MTIM_TV_NSEC=1' -DHAVE_BUILTIN_THREAD_POINTER '-DLLVM_NATIVE_ARCH="X86"' '-DLLVM_NATIVE_ASMPARSER=LLVMInitializeX86AsmParser' '-DLLVM_NATIVE_ASMPRINTER=LLVMInitializeX86AsmPrinter' '-DLLVM_NATIVE_DISASSEMBLER=LLVMInitializeX86Disassembler' '-DLLVM_NATIVE_TARGET=LLVMInitializeX86Target' '-DLLVM_NATIVE_TARGETINFO=LLVMInitializeX86TargetInfo' '-DLLVM_NATIVE_TARGETMC=LLVMInitializeX86TargetMC' '-DLLVM_NATIVE_TARGETMCA=LLVMInitializeX86TargetMCA' '-DLLVM_HOST_TRIPLE="x86_64-unknown-linux-gnu"' '-DLLVM_DEFAULT_TARGET_TRIPLE="x86_64-unknown-linux-gnu"' '-DLLVM_VERSION_MAJOR=19' '-DLLVM_VERSION_MINOR=0' '-DLLVM_VERSION_PATCH=0' '-DLLVM_VERSION_STRING="19.0.0git"' -D__STDC_LIMIT_MACROS -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS '-DLLVM_HAS_AArch64_TARGET=1' '-DLLVM_HAS_AMDGPU_TARGET=1' '-DLLVM_HAS_ARM_TARGET=1' '-DLLVM_HAS_NVPTX_TARGET=1' '-DLLVM_HAS_PowerPC_TARGET=1' '-DLLVM_HAS_RISCV_TARGET=1' '-DLLVM_HAS_SystemZ_TARGET=1' '-DLLVM_HAS_X86_TARGET=1' '-DBLAKE3_USE_NEON=0' -DBLAKE3_NO_AVX2 -DBLAKE3_NO_AVX512 -DBLAKE3_NO_SSE2 -DBLAKE3_NO_SSE41 '-DBAZEL_CURRENT_REPOSITORY=""' -iquote . -iquote bazel-out/k8-opt/bin -iquote external/com_google_absl -iquote bazel-out/k8-opt/bin/external/com_google_absl -iquote external/tsl -iquote bazel-out/k8-opt/bin/external/tsl -iquote external/eigen_archive -iquote bazel-out/k8-opt/bin/external/eigen_archive -iquote external/ml_dtypes -iquote bazel-out/k8-opt/bin/external/ml_dtypes -iquote external/nsync -iquote bazel-out/k8-opt/bin/external/nsync -iquote external/double_conversion -iquote bazel-out/k8-opt/bin/external/double_conversion -iquote external/com_google_protobuf -iquote bazel-out/k8-opt/bin/external/com_google_protobuf -iquote external/snappy -iquote bazel-out/k8-opt/bin/external/snappy -iquote external/com_googlesource_code_re2 -iquote bazel-out/k8-opt/bin/external/com_googlesource_code_re2 -iquote external/farmhash_archive -iquote bazel-out/k8-opt/bin/external/farmhash_archive -iquote external/zlib -iquote bazel-out/k8-opt/bin/external/zlib -iquote external/llvm-project -iquote bazel-out/k8-opt/bin/external/llvm-project -iquote external/stablehlo -iquote bazel-out/k8-opt/bin/external/stablehlo -iquote external/local_config_cuda -iquote bazel-out/k8-opt/bin/external/local_config_cuda -Ibazel-out/k8-opt/bin/external/ml_dtypes/_virtual_includes/float8 -Ibazel-out/k8-opt/bin/external/ml_dtypes/_virtual_includes/intn -Ibazel-out/k8-opt/bin/external/llvm-project/mlir/_virtual_includes/ArithCanonicalizationIncGen -Ibazel-out/k8-opt/bin/external/llvm-project/mlir/_virtual_includes/AsmParserTokenKinds -Ibazel-out/k8-opt/bin/xla/mlir_hlo/_virtual_includes/mlir_hlo -Ibazel-out/k8-opt/bin/xla/mlir_hlo/_virtual_includes/canonicalize_inc_gen -Ibazel-out/k8-opt/bin/xla/mlir_hlo/_virtual_includes/convert_op_folder -Ibazel-out/k8-opt/bin/xla/mlir_hlo/_virtual_includes/hlo_ops_attrs_inc_gen -Ibazel-out/k8-opt/bin/xla/mlir_hlo/_virtual_includes/hlo_ops_common -Ibazel-out/k8-opt/bin/xla/mlir_hlo/_virtual_includes/hlo_ops_enums_inc_gen -Ibazel-out/k8-opt/bin/xla/mlir_hlo/_virtual_includes/hlo_ops_inc_gen -Ibazel-out/k8-opt/bin/xla/mlir_hlo/_virtual_includes/hlo_ops_pattern_inc_gen -Ibazel-out/k8-opt/bin/xla/mlir_hlo/_virtual_includes/hlo_ops_typedefs_inc_gen -Ibazel-out/k8-opt/bin/external/llvm-project/mlir/_virtual_includes/MLIRShapeCanonicalizationIncGen -Ibazel-out/k8-opt/bin/external/stablehlo/_virtual_includes/base -Ibazel-out/k8-opt/bin/external/stablehlo/_virtual_includes/base_attr_interfaces_inc_gen -Ibazel-out/k8-opt/bin/external/stablehlo/_virtual_includes/broadcast_utils -Ibazel-out/k8-opt/bin/external/stablehlo/_virtual_includes/chlo_ops -Ibazel-out/k8-opt/bin/external/stablehlo/_virtual_includes/chlo_attrs_inc_gen -Ibazel-out/k8-opt/bin/external/stablehlo/_virtual_includes/chlo_enums_inc_gen -Ibazel-out/k8-opt/bin/external/stablehlo/_virtual_includes/chlo_ops_inc_gen -Ibazel-out/k8-opt/bin/external/stablehlo/_virtual_includes/stablehlo_type_inference -Ibazel-out/k8-opt/bin/external/stablehlo/_virtual_includes/stablehlo_assembly_format -Ibazel-out/k8-opt/bin/external/stablehlo/_virtual_includes/stablehlo_ops -Ibazel-out/k8-opt/bin/external/stablehlo/_virtual_includes/stablehlo_attrs_inc_gen -Ibazel-out/k8-opt/bin/external/stablehlo/_virtual_includes/stablehlo_enums_inc_gen -Ibazel-out/k8-opt/bin/external/stablehlo/_virtual_includes/stablehlo_ops_inc_gen -Ibazel-out/k8-opt/bin/external/stablehlo/_virtual_includes/stablehlo_types_inc_gen -Ibazel-out/k8-opt/bin/external/stablehlo/_virtual_includes/version -Ibazel-out/k8-opt/bin/external/local_config_cuda/cuda/_virtual_includes/cuda_headers_virtual -isystem external/eigen_archive -isystem bazel-out/k8-opt/bin/external/eigen_archive -isystem external/eigen_archive/mkl_include -isystem bazel-out/k8-opt/bin/external/eigen_archive/mkl_include -isystem external/ml_dtypes -isystem bazel-out/k8-opt/bin/external/ml_dtypes -isystem external/ml_dtypes/ml_dtypes -isystem bazel-out/k8-opt/bin/external/ml_dtypes/ml_dtypes -isystem external/nsync/public -isystem bazel-out/k8-opt/bin/external/nsync/public -isystem external/com_google_protobuf/src -isystem bazel-out/k8-opt/bin/external/com_google_protobuf/src -isystem external/farmhash_archive/src -isystem bazel-out/k8-opt/bin/external/farmhash_archive/src -isystem external/zlib -isystem bazel-out/k8-opt/bin/external/zlib -isystem external/llvm-project/llvm/include -isystem bazel-out/k8-opt/bin/external/llvm-project/llvm/include -isystem external/llvm-project/mlir/include -isystem bazel-out/k8-opt/bin/external/llvm-project/mlir/include -isystem external/local_config_cuda/cuda -isystem bazel-out/k8-opt/bin/external/local_config_cuda/cuda -isystem external/local_config_cuda/cuda/cuda/include -isystem bazel-out/k8-opt/bin/external/local_config_cuda/cuda/cuda/include -Wno-all -Wno-extra -Wno-deprecated -Wno-deprecated-declarations -Wno-ignored-attributes -Wno-array-bounds -Wunused-result '-Werror=unused-result' -Wswitch '-Werror=switch' '-Wno-error=unused-but-set-variable' -DAUTOLOAD_DYNAMIC_KERNELS '-std=c++17' -fno-canonical-system-headers -Wno-builtin-macro-redefined '-D__DATE__="redacted"' '-D__TIMESTAMP__="redacted"' '-D__TIME__="redacted"' -c xla/service/gpu/runtime/command_buffer_cmd_emitter.cc -o bazel-out/k8-opt/bin/xla/service/gpu/runtime/_objs/command_buffer_cmd_emitter/command_buffer_cmd_emitter.o) 1718167059.348649823: src/main/tools/linux-sandbox.cc:152: calling pipe(2)... 1718167059.348690614: src/main/tools/linux-sandbox.cc:171: calling clone(2)... 1718167059.348957863: src/main/tools/linux-sandbox.cc:180: linux-sandbox-pid1 has PID 544152 1718167059.349011563: src/main/tools/linux-sandbox-pid1.cc:681: Pid1Main started 1718167059.349068654: src/main/tools/linux-sandbox.cc:197: done manipulating pipes 1718167059.349224017: src/main/tools/linux-sandbox-pid1.cc:285: working dir: /home/dss/.cache/bazel/_bazel_dss/4bc65a6054bf39baecdeb1fdf79bd001/sandbox/linux-sandbox/21666/execroot/xla 1718167059.349242383: src/main/tools/linux-sandbox-pid1.cc:320: writable: /home/dss/.cache/bazel/_bazel_dss/4bc65a6054bf39baecdeb1fdf79bd001/sandbox/linux-sandbox/21666/execroot/xla 1718167059.349247351: src/main/tools/linux-sandbox-pid1.cc:320: writable: /tmp 1718167059.349252995: src/main/tools/linux-sandbox-pid1.cc:320: writable: /dev/shm 1718167059.349314279: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: / 1718167059.349321272: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /dev 1718167059.349325280: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /dev/pts 1718167059.349328638: src/main/tools/linux-sandbox-pid1.cc:400: remount rw: /dev/shm 1718167059.349331950: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /dev/mqueue 1718167059.349335225: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /dev/hugepages 1718167059.349338941: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /run 1718167059.349342823: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /run/lock 1718167059.349346113: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /run/user/125 1718167059.349349960: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /run/user/1000 1718167059.349354204: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /run/user/1000/gvfs 1718167059.349357892: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /run/user/1000/doc 1718167059.349361465: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /sys 1718167059.349388826: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /sys/kernel/security 1718167059.349394438: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /sys/fs/cgroup 1718167059.349398851: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /sys/fs/cgroup/unified 1718167059.349402873: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /sys/fs/cgroup/systemd 1718167059.349406058: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /sys/fs/cgroup/hugetlb 1718167059.349409167: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /sys/fs/cgroup/perf_event 1718167059.349412619: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /sys/fs/cgroup/devices 1718167059.349415630: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /sys/fs/cgroup/cpu,cpuacct 1718167059.349418849: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /sys/fs/cgroup/cpuset 1718167059.349422056: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /sys/fs/cgroup/memory 1718167059.349426119: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /sys/fs/cgroup/net_cls,net_prio 1718167059.349429886: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /sys/fs/cgroup/pids 1718167059.349457334: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /sys/fs/cgroup/rdma 1718167059.349461439: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /sys/fs/cgroup/misc 1718167059.349464889: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /sys/fs/cgroup/freezer 1718167059.349469010: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /sys/fs/cgroup/blkio 1718167059.349472636: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /sys/fs/pstore 1718167059.349476298: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /sys/firmware/efi/efivars 1718167059.349482547: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /sys/fs/bpf 1718167059.349485768: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /sys/kernel/debug 1718167059.349490061: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /sys/kernel/tracing 1718167059.349499262: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /sys/fs/fuse/connections 1718167059.349504425: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /sys/kernel/config 1718167059.349507777: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /proc 1718167059.349511436: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /proc/sys/fs/binfmt_misc 1718167059.349518871: src/main/tools/linux-sandbox-pid1.cc:422: remount(nullptr, /proc/sys/fs/binfmt_misc, nullptr, 2101281, nullptr) failure (Operation not permitted) ignored 1718167059.349548339: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /proc/sys/fs/binfmt_misc 1718167059.349554331: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /snap/core20/1611 1718167059.349558542: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /snap/gnome-3-38-2004/143 1718167059.349562293: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /snap/core22/1380 1718167059.349565282: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /snap/bare/5 1718167059.349568417: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /snap/gnome-42-2204/176 1718167059.349571436: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /snap/snap-store/1113 1718167059.349574650: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /snap/core20/2318 1718167059.349577818: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /snap/gtk-common-themes/1535 1718167059.349581126: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /snap/gnome-3-38-2004/115 1718167059.349584005: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /snap/snap-store/558 1718167059.349587215: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /snap/snapd/21465 1718167059.349590259: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /snap/snapd/21759 1718167059.349601449: src/main/tools/linux-sandbox-pid1.cc:400: remount ro: /boot/efi 1718167059.349607019: src/main/tools/linux-sandbox-pid1.cc:400: remount rw: /home/dss/.cache/bazel/_bazel_dss/4bc65a6054bf39baecdeb1fdf79bd001/sandbox/linux-sandbox/21666/execroot/xla 1718167059.349612795: src/main/tools/linux-sandbox-pid1.cc:400: remount rw: /home/dss/.cache/bazel/_bazel_dss/4bc65a6054bf39baecdeb1fdf79bd001/sandbox/linux-sandbox/21666/execroot/xla 1718167059.349616337: src/main/tools/linux-sandbox-pid1.cc:400: remount rw: /tmp 1718167059.349620021: src/main/tools/linux-sandbox-pid1.cc:400: remount rw: /dev/shm 1718167059.349661765: src/main/tools/linux-sandbox-pid1.cc:491: calling fork... 1718167059.349811712: src/main/tools/linux-sandbox-pid1.cc:521: child started with PID 2 xla/service/gpu/runtime/command_buffer_cmd_emitter.cc:32:10: fatal error: xla/service/gpu/runtime/gpublas_lt_matmul_thunk.h: No such file or directory 32 | #include "xla/service/gpu/runtime/gpublas_lt_matmul_thunk.h" | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ compilation terminated. 1718167060.577473951: src/main/tools/linux-sandbox-pid1.cc:538: wait returned pid=2, status=0x100 1718167060.577493406: src/main/tools/linux-sandbox-pid1.cc:556: child exited normally with code 1 1718167060.578001071: src/main/tools/linux-sandbox.cc:233: child exited normally with code 1 INFO: Elapsed time: 194.254s, Critical Path: 41.62s INFO: 345 processes: 53 internal, 292 linux-sandbox. FAILED: Build did NOT complete successfully
Did you figure out what the issue was? I'm seeing this in a vanilla build on main branch?
I add some auto sharding codes in
xla/hlo/experimental/auto_sharding
folder, but build has an error. The code can be found in this PR. build using command:Env: python version is Python3.9 nvcc version:
clang version:
The error msg: