jax-ml / jax

Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more
http://jax.readthedocs.io/
Apache License 2.0
30.66k stars 2.83k forks source link

Building on linux with ppc64le CPU #4493

Open f0uriest opened 4 years ago

f0uriest commented 4 years ago

I'm trying to build jax on a cluster that uses IBM power9 processors (it's a sister cluster to Summit at ORNL). It seems to be failing when trying to build XLA, which is strange because I've been able to install tensorflow just fine. The full output log is here: https://gist.github.com/f0uriest/5f04e2ed9916bb750a9ea679633ac80c

Any ideas? Is there any plan to offer pre-build wheels for ppc64le architecture?

hawkinsp commented 4 years ago

We don't support the PPC architecture ourselves and most likely don't have the engineer bandwidth to maintain such a build.

But we wouldn't object if the community wanted to supported ppc64le. There are likely two pieces:

Contributions welcome!

hawkinsp commented 4 years ago

As to your specific question, I'd make sure that MKLDNN is disabled in the build (I believe there is an option to build.py for this.) I doubt MKLDNN works on non-Intel architectures.

feifzhou commented 3 years ago

I am working on the same issue. I managed to reach this point

https://gist.github.com/feifzhou/152d5c6e15e3485befa78e69cd340c32#file-gistfile1-txt

But got

  1. Warning about 404 ERROR while downloading a .gz file
  2. "failed: undeclared inclusion(s)" errors.

I tried both gcc 7.3.1 and 8.3.1 with same inclusion errors. Gcc 4.9.3 got me lots of syntax errors. MKLDNN was disabled.

As to your specific question, I'd make sure that MKLDNN is disabled in the build (I believe there is an option to build.py for this.) I doubt MKLDNN works on non-Intel architectures.

hawkinsp commented 3 years ago

@feifzhou I don't know how to solve your issue, but it looks to me like Bazel isn't understanding something about the location of the standard library headers on your system. Do you have the same problem if you try to build TensorFlow? We share a lot of build infrastructure with them, so I'm wondering if this is JAX specific or a more general Bazel/TF problem.

(Ultimately we don't have cycles to work on this, but we welcome contributions!)

mrorro commented 3 years ago

@f0uriest I've built v0.1.55 successfully on an IBM power 9 but more recent version fail in the same way

f0uriest commented 3 years ago

@mrorro Yeah I haven't been able to build any version since 0.1.55 either. It looks like at some point they switched some of the compiler flags to ones that are only defined for x86-64 architectures. Bazel supposedly lets you override these but I haven't gotten it to work yet.

hawkinsp commented 3 years ago

@f0uriest If you can share the output of the build, we might be able to suggest things to change.

I'd speculate there are two or three things you'd need to do : a) update build.py to pass the correct flags, if it isn't already doing so. b) make sure XLA links in the Power LLVM backend if targeting Power. There are already cases for x86 and ARM; I don't recall if Power is included. c) add a Power case to build_wheel.py.

proutrc commented 3 years ago

@f0uriest @mrorro

Did you all happen to make progress with this issue? We are looking to build JAX on Summit and I happened upon this issue/discussion.

feifzhou commented 3 years ago

Nothing so far. Would love to see if you can solve it on Summit.

On Thu, Jul 22, 2021 at 5:38 PM proutrc @.***> wrote:

@f0uriest https://github.com/f0uriest @mrorro https://github.com/mrorro

Did you all happen to make progress with this issue? We are looking to build JAX on Summit and I happened upon this issue/discussion.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/jax/issues/4493#issuecomment-885327371, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADKT6A6WTKF7XN3XTRB2EA3TZC2ZJANCNFSM4SIJHG4A .

f0uriest commented 3 years ago

I also haven't made any progress but haven't had much time to work on it either. We've been using 0.1.55 for a while, though I'm hoping to upgrade later this summer

asedova commented 3 years ago

@hawkinsp on Summit, after fixing compiler flags, we are also getting the download error: WARNING: Download from http://mirror.tensorflow.org/github.com/tensorflow/runtime/archive/d29d1ef0a65a8f9c23e1f88067ce4205d3085e87.tar.gz failed: class com.google.devtools.build.lib.bazel.repository.downloader.UnrecoverableHttpException GET returned 404 Not Found

hawkinsp commented 3 years ago

@asedova That is a benign warning, you can ignore it.

hawkinsp commented 3 years ago

I was able to cross-compile a ppc64le wheel on an x86-64 machine by following the instructions in #7365. I can't easily test the resulting wheel, though.

I would imagine that building natively on a ppc64le machine requires nothing other than following the standard instructions once the changes in #7365 are merged.

asedova commented 3 years ago

@asedova That is a benign warning, you can ignore it.

Thanks

asedova commented 3 years ago

I was able to cross-compile a ppc64le wheel on an x86-64 machine by following the instructions in #7365. I can't easily test the resulting wheel, though.

I would imagine that building natively on a ppc64le machine requires nothing other than following the standard instructions once the changes in #7365 are merged.

Thanks @hawkinsp we are eagerly awaiting this merge

hawkinsp commented 3 years ago

One thing I'd like to double check: what does:

import platform
print(platform.machine())

print on your PPC machine?

And is it little endian?

f0uriest commented 3 years ago

On my system I get

>>> import platform
>>> print(platform.machine())
ppc64le

It is little-endian

asedova commented 3 years ago

yes we are LE also

asedova commented 3 years ago

@f0uriest you guys are on Sierra?

feifzhou commented 3 years ago

On Lassen

On Fri, Jul 23, 2021 at 8:56 AM asedova @.***> wrote:

@f0uriest you guys are on Sierra?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

f0uriest commented 3 years ago

@asedova Traverse at PPPL

hawkinsp commented 3 years ago

7365 is merged. Please try building jaxlib from git head. If it doesn't work, please post logs so I can try to debug it.

I could also share the cross-compiled wheel I made for Python 3.9; but I have no idea if it actually works. So it's probably best if you make sure it builds for you.

proutrc commented 3 years ago

@hawkinsp Here is what I see on initial attempt: summit_jaxlib.log

Just for record, I have tried various versions of GCC (6.4.0, 7.4.0, 8.1.1)

Notable errors:

gcc: error: unrecognized command line option '-std=c++14'

ERROR: /tmp/_bazel_rprout/b2ebe10a0ad0f6175e81a930563cb9d3/external/com_google_protobuf/BUILD:155:11: Compiling src/google/protobuf/util/internal/datapiece.cc [for host] failed: (Exit 1): crosstool_wrapper_driver_is_not_gcc failed: error executing command (cd /tmp/_baz

This second error will point at different source files for different runs it seems.

@asedova do you see anything different?

hawkinsp commented 3 years ago

You need a C++14 compiler to build JAX.

Something seems surprising here, though. gcc 6.1 and newer apparently support C++14: https://gcc.gnu.org/projects/cxx-status.html#cxx14

Note the documentation is quite clear that -std=c++14 is a flag gcc accepts! So this seems like something you need to figure out about your gcc installation.

proutrc commented 3 years ago

@hawkinsp Apologies, I could have goofed that one actually. I thought had GCC loaded...

Here is a run with GCC 7.4.0:summit_jaxlib-gcc7.4.0.log

hawkinsp commented 3 years ago

@proutrc The issue here is that bazel hermeticity checking is upset that you appear to be reading header files outside what it considers to be the standard system paths.

I think your best fix here might be to write a small custom Bazel toolchain. As it happens, I show an example of how to do that in a comment in #7365. It's not that bad, you should be able to just modify my example. You would need to modify cxx_builtin_include_directories to include that header directory, and you'd need to change the other tool paths to point to the right places on your system.

In your case, you'd want to set host_crosstool_top to the same toolchain as crosstool_top.

proutrc commented 3 years ago

@hawkinsp sorry for my ignorance.. but, is the mentioned toolchain directory from the top of the jax repp or in the build directory? My familiarity with bazel and its setup is limited, unfortunately. I am happy to work on this though, just want to make sure I am setup properly.

hawkinsp commented 3 years ago

@proutrc In the example I gave, it's at the root of the JAX repository. (It doesn't matter a whole lot, so long as all the paths agree, and in my command line, etc. I refer to it as //toolchain, which is at the root of the repository.)

proutrc commented 3 years ago

@hawkinsp

I seem to still run into similar issues. Is there anything else I am missing, besides an update to those paths for the tools and the cxx_builtin_include_directories? I am also putting the realpath in the cxx_builtin_include_directories list, but I see it has the non-realpath in the error output. Sometimes it does have the realpath though, oddly. I appreciate your help.

def _impl(ctx):
   return cc_common.create_cc_toolchain_config_info(
       ctx = ctx,
       features = features, # NEW
       cxx_builtin_include_directories = [
          "/autofs/nccs-svm1_sw/summit/gcc/7.4.0/lib/gcc/powerpc64le-none-linux-gnu/7.4.0/include/",
          "/autofs/nccs-svm1_sw/summit/gcc/7.4.0/include/",
          "/autofs/nccs-svm1_sw/summit/gcc/7.4.0/lib/gcc/powerpc64le-none-linux-gnu/7.4.0/include-fixed/",
          "/usr/include/",
       ],

Error (it does seem to get further sometimes):

[0 / 31] [Prepa] Creating source manifest for //build:build_wheel ... (5 actions, 0 running)
[68 / 529] Compiling src/google/protobuf/generated_enum_util.cc [for host]; 2s local ... (128 actions running)
[76 / 529] Compiling src/google/protobuf/generated_enum_util.cc [for host]; 5s local ... (128 actions running)
[83 / 529] Compiling src/google/protobuf/extension_set.cc [for host]; 9s local ... (128 actions running)
[89 / 529] Compiling src/google/protobuf/extension_set.cc [for host]; 13s local ... (128 actions running)
[98 / 529] Compiling src/google/protobuf/extension_set.cc [for host]; 18s local ... (128 actions, 127 running)
ERROR: /gpfs/alpine/stf007/scratch/rprout/jax/jaxlib/BUILD:352:17: Compiling jaxlib/cpu_feature_guard.c failed: undeclared inclusion(s) in rule '//jaxlib:cpu_feature_guard.so':
this rule is missing dependency declarations for the following files included by 'jaxlib/cpu_feature_guard.c':
  '/sw/summit/gcc/7.4.0/lib/gcc/powerpc64le-none-linux-gnu/7.4.0/include-fixed/limits.h'
  '/sw/summit/gcc/7.4.0/lib/gcc/powerpc64le-none-linux-gnu/7.4.0/include-fixed/syslimits.h'
  '/sw/summit/gcc/7.4.0/lib/gcc/powerpc64le-none-linux-gnu/7.4.0/include/stddef.h'
  '/sw/summit/gcc/7.4.0/lib/gcc/powerpc64le-none-linux-gnu/7.4.0/include/stdarg.h'
  '/sw/summit/gcc/7.4.0/lib/gcc/powerpc64le-none-linux-gnu/7.4.0/include/stdint.h'
Target //build:build_wheel failed to build
INFO: Elapsed time: 40.420s, Critical Path: 22.55s
INFO: 333 processes: 241 internal, 92 local.
FAILED: Build did NOT complete successfully
ERROR: Build failed. Not running target
FAILED: Build did NOT complete successfully
b''
Traceback (most recent call last):
  File "build/build.py", line 604, in <module>
    main()
  File "build/build.py", line 599, in main
    shell(command)
  File "build/build.py", line 52, in shell
    output = subprocess.check_output(cmd)
  File "/sw/summit/python/3.7/anaconda3/5.3.0/lib/python3.7/subprocess.py", line 376, in check_output
    **kwargs).stdout
  File "/sw/summit/python/3.7/anaconda3/5.3.0/lib/python3.7/subprocess.py", line 468, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/sw/.testing/belhorn/summit/bin/bazel', 'run', '--verbose_failures=true', '--host_crosstool_top=//toolchain:ppc', '--crosstool_top=//toolchain:ppc', '--config=short_logs', '--config=cuda', '--define=xla_python_enable_gpu=true', ':build_wheel', '--', '--output_path=/gpfs/alpine/stf007/scratch/rprout/jax/dist', '--cpu=ppc64le']' returned non-zero exit status 1.
hawkinsp commented 3 years ago

@proutc Try with --bazel_options=--cpu=ppc.

proutrc commented 3 years ago

@hawkinsp

INFO: Found 1 target...
[0 / 68] [Prepa] Writing file jaxlib/lapack.so-2.params
ERROR: /tmp/_bazel_rprout/b2ebe10a0ad0f6175e81a930563cb9d3/external/com_google_absl/absl/base/BUILD.bazel:596:11: Compiling absl/base/internal/exponential_biased.cc failed: undeclared inclusion(s) in rule '@com_google_absl//absl/base:exponential_biased':
this rule is missing dependency declarations for the following files included by 'absl/base/internal/exponential_biased.cc':
  '/sw/summit/gcc/7.4.0/lib/gcc/powerpc64le-none-linux-gnu/7.4.0/include/stdint.h'
  '/sw/summit/gcc/7.4.0/lib/gcc/powerpc64le-none-linux-gnu/7.4.0/include-fixed/limits.h'
  '/sw/summit/gcc/7.4.0/lib/gcc/powerpc64le-none-linux-gnu/7.4.0/include-fixed/syslimits.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/cstddef'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/powerpc64le-none-linux-gnu/bits/c++config.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/powerpc64le-none-linux-gnu/bits/os_defines.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/powerpc64le-none-linux-gnu/bits/cpu_defines.h'
  '/sw/summit/gcc/7.4.0/lib/gcc/powerpc64le-none-linux-gnu/7.4.0/include/stddef.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/ciso646'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/cassert'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/algorithm'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/utility'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/stl_relops.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/stl_pair.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/move.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/concept_check.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/type_traits'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/initializer_list'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/stl_algobase.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/functexcept.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/exception_defines.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/cpp_type_traits.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/ext/type_traits.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/ext/numeric_traits.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/stl_iterator_base_types.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/stl_iterator_base_funcs.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/debug/assertions.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/stl_iterator.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/ptr_traits.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/debug/debug.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/predefined_ops.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/stl_algo.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/cstdlib'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/std_abs.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/algorithmfwd.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/stl_heap.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/stl_tempbuf.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/stl_construct.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/new'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/exception'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/exception.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/exception_ptr.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/cxxabi_init_exception.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/typeinfo'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/hash_bytes.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/nested_exception.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/ext/alloc_traits.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/alloc_traits.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/memoryfwd.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/uniform_int_dist.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/limits'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/atomic'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/atomic_base.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/atomic_lockfree_defines.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/cmath'
Target //build:build_wheel failed to build
INFO: Elapsed time: 19.438s, Critical Path: 1.74s
INFO: 245 processes: 215 internal, 30 local.
FAILED: Build did NOT complete successfully
ERROR: Build failed. Not running target
FAILED: Build did NOT complete successfully
b''
Traceback (most recent call last):
  File "build/build.py", line 604, in <module>
    main()
  File "build/build.py", line 599, in main
    shell(command)
  File "build/build.py", line 52, in shell
    output = subprocess.check_output(cmd)
  File "/sw/summit/python/3.7/anaconda3/5.3.0/lib/python3.7/subprocess.py", line 376, in check_output
    **kwargs).stdout
  File "/sw/summit/python/3.7/anaconda3/5.3.0/lib/python3.7/subprocess.py", line 468, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/sw/.testing/belhorn/summit/bin/bazel', 'run', '--verbose_failures=true', '--host_crosstool_top=//toolchain:ppc', '--crosstool_top=//toolchain:ppc', '--cpu=ppc', '--config=short_logs', '--config=cuda', '--define=xla_python_enable_gpu=true', ':build_wheel', '--', '--output_path=/gpfs/alpine/stf007/scratch/rprout/jax/dist', '--cpu=ppc64le']' returned non-zero exit status 1.
hawkinsp commented 3 years ago

Try putting the /sw/... paths in the cxx_builtin_include_directories. Or put try putting both.

proutrc commented 3 years ago

Unfortunately, I have tried all as well:

Error:

[0 / 35] [Prepa] Creating source manifest for //build:build_wheel
[64 / 542] Compiling src/google/protobuf/any_lite.cc [for host]; 3s local ... (127 actions, 126 running)
[80 / 542] Compiling src/google/protobuf/extension_set.cc [for host]; 6s local ... (127 actions running)
[89 / 542] Compiling src/google/protobuf/extension_set.cc [for host]; 9s local ... (128 actions running)
[100 / 544] Compiling src/google/protobuf/extension_set.cc [for host]; 13s local ... (127 actions running)
[157 / 614] Compiling src/google/protobuf/extension_set.cc [for host]; 17s local ... (128 actions running)
[197 / 716] Compiling src/google/protobuf/extension_set.cc [for host]; 22s local ... (128 actions running)
[239 / 716] Compiling src/google/protobuf/compiler/cpp/cpp_helpers.cc [for host]; 27s local ... (128 actions running)
ERROR: /tmp/_bazel_rprout/b2ebe10a0ad0f6175e81a930563cb9d3/external/com_google_absl/absl/time/internal/cctz/BUILD.bazel:53:11: Compiling absl/time/internal/cctz/src/time_zone_posix.cc failed: undeclared inclusion(s) in rule '@com_google_absl//absl/time/internal/cctz:time_zone':
this rule is missing dependency declarations for the following files included by 'absl/time/internal/cctz/src/time_zone_posix.cc':
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/cstdint'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/powerpc64le-none-linux-gnu/bits/c++config.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/powerpc64le-none-linux-gnu/bits/os_defines.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/powerpc64le-none-linux-gnu/bits/cpu_defines.h'
  '/sw/summit/gcc/7.4.0/lib/gcc/powerpc64le-none-linux-gnu/7.4.0/include/stdint.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/string'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/stringfwd.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/memoryfwd.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/char_traits.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/stl_algobase.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/functexcept.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/exception_defines.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/cpp_type_traits.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/ext/type_traits.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/ext/numeric_traits.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/stl_pair.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/move.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/concept_check.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/type_traits'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/stl_iterator_base_types.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/stl_iterator_base_funcs.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/debug/assertions.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/stl_iterator.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/ptr_traits.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/debug/debug.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/predefined_ops.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/postypes.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/cwchar'
  '/sw/summit/gcc/7.4.0/lib/gcc/powerpc64le-none-linux-gnu/7.4.0/include/stdarg.h'
  '/sw/summit/gcc/7.4.0/lib/gcc/powerpc64le-none-linux-gnu/7.4.0/include/stddef.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/allocator.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/powerpc64le-none-linux-gnu/bits/c++allocator.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/ext/new_allocator.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/new'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/exception'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/exception.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/exception_ptr.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/cxxabi_init_exception.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/typeinfo'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/hash_bytes.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/nested_exception.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/localefwd.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/powerpc64le-none-linux-gnu/bits/c++locale.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/clocale'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/iosfwd'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/cctype'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/ostream_insert.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/cxxabi_forced.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/stl_function.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/backward/binders.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/range_access.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/initializer_list'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/basic_string.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/ext/atomicity.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/powerpc64le-none-linux-gnu/bits/gthr.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/powerpc64le-none-linux-gnu/bits/gthr-default.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/powerpc64le-none-linux-gnu/bits/atomic_word.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/ext/alloc_traits.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/alloc_traits.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/ext/string_conversions.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/cstdlib'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/std_abs.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/cstdio'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/cerrno'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/functional_hash.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/bits/basic_string.tcc'
  '/sw/summit/gcc/7.4.0/lib/gcc/powerpc64le-none-linux-gnu/7.4.0/include-fixed/limits.h'
  '/sw/summit/gcc/7.4.0/lib/gcc/powerpc64le-none-linux-gnu/7.4.0/include-fixed/syslimits.h'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/cstddef'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/ciso646'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/cstring'
  '/sw/summit/gcc/7.4.0/include/c++/7.4.0/limits'
Target //build:build_wheel failed to build
INFO: Elapsed time: 48.498s, Critical Path: 33.11s
INFO: 376 processes: 221 internal, 155 local.
FAILED: Build did NOT complete successfully
ERROR: Build failed. Not running target
FAILED: Build did NOT complete successfully
b''
Traceback (most recent call last):
  File "build/build.py", line 604, in <module>
    main()
  File "build/build.py", line 599, in main
    shell(command)
  File "build/build.py", line 52, in shell
    output = subprocess.check_output(cmd)
  File "/sw/summit/python/3.7/anaconda3/5.3.0/lib/python3.7/subprocess.py", line 376, in check_output
    **kwargs).stdout
  File "/sw/summit/python/3.7/anaconda3/5.3.0/lib/python3.7/subprocess.py", line 468, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/sw/.testing/belhorn/summit/bin/bazel', 'run', '--verbose_failures=true', '--host_crosstool_top=//toolchain:ppc', '--crosstool_top=//toolchain:ppc', '--cpu=ppc', '--config=short_logs', '--config=cuda', '--define=xla_python_enable_gpu=true', ':build_wheel', '--', '--output_path=/gpfs/alpine/stf007/scratch/rprout/jax/dist', '--cpu=ppc64le']' returned non-zero exit status 1.

cxx_builtin_include_directories list:

def _impl(ctx):
   return cc_common.create_cc_toolchain_config_info(
       ctx = ctx,
       features = features, # NEW
       cxx_builtin_include_directories = [
          "/autofs/nccs-svm1_sw/summit/gcc/7.4.0/lib/gcc/powerpc64le-none-linux-gnu/7.4.0/include",
          "/sw/summit/gcc/7.4.0/lib/gcc/powerpc64le-none-linux-gnu/7.4.0/include",
          "/autofs/nccs-svm1_sw/summit/gcc/7.4.0/include",
          "/sw/summit/gcc/7.4.0/include",
          "/autofs/nccs-svm1_sw/summit/gcc/7.4.0/lib/gcc/powerpc64le-none-linux-gnu/7.4.0/include-fixed",
          "/sw/summit/gcc/7.4.0/lib/gcc/powerpc64le-none-linux-gnu/7.4.0/include-fixed",
          "/usr/include",
       ],
proutrc commented 3 years ago

@hawkinsp perhaps relative to here?

https://github.com/bazelbuild/bazel/issues/9451

asedova commented 3 years ago

@f0uriest you said you got a previous version to build?

feifzhou commented 3 years ago

Not me..

On Fri, Jul 23, 2021 at 2:36 PM asedova @.***> wrote:

@f0uriest https://github.com/f0uriest you said you got a previous version to build?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/jax/issues/4493#issuecomment-885926196, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADKT6AZ6N6R4VLYQPUC7EQTTZHOG3ANCNFSM4SIJHG4A .

proutrc commented 3 years ago

@hawkinsp it looks like I was able to get around the undeclared inclusion(s) by adding this to a CROSSTOOL file, in the toolchain/ directory. This is in addition to the cc_toolchain_config.bzl and BUILD file we altered from your example.

[rprout@login1.summit jax]$ ls toolchain/
BUILD  CROSSTOOL  cc_toolchain_config.bzl
[rprout@login1.summit jax]$ cat toolchain/CROSSTOOL 
compiler_flag: "-isystem"
compiler_flag: "/sw/summit/gcc/7.4.0/lib/gcc/powerpc64le-none-linux-gnu/7.4.0/include-fixed"
compiler_flag: "-isystem"
compiler_flag: "/sw/summit/gcc/7.4.0/include/c++/7.4.0"
compiler_flag: "-isystem"
compiler_flag: "/sw/summit/gcc/7.4.0/lib/gcc/powerpc64le-none-linux-gnu/7.4.0/include"

Here we are now: summit-jaxlib.log

proutrc commented 3 years ago

It actually looks like this CROSSTOOL file addition is not they key. The real key seems to be throttling Bazel wih this addition to my ~/.bazelrc file:

build --jobs 2 --local_ram_resources=HOST_RAM*0.04 
test --jobs 2

It looks like you want to throttle Bazel if you use NFS at all. Summit's provided software tree (/sw) is on NFS. I had forgotten I added the above throttling in all this. As soon as I removed it though, from my ~/.bazelrc file, the undeclared inclusion(s) came back.

@feifzhou You should try the throttling method above, by adding that to your ~/.bazelrc. Perhaps you will then also get passed the undeclared inclusion(s). Maybe you have something on NFS?

In the end, we now seem to be in a similar boat as @f0uriest. Our log now similarly points at Eigen. I will try a different GCC next (maybe some additional flags?).

@hawkinsp it looks like I was able to get around the undeclared inclusion(s) by adding this to a CROSSTOOL file, in the toolchain/ directory. This is in addition to the cc_toolchain_config.bzl and BUILD file we altered from your example.

[rprout@login1.summit jax]$ ls toolchain/
BUILD  CROSSTOOL  cc_toolchain_config.bzl
[rprout@login1.summit jax]$ cat toolchain/CROSSTOOL 
compiler_flag: "-isystem"
compiler_flag: "/sw/summit/gcc/7.4.0/lib/gcc/powerpc64le-none-linux-gnu/7.4.0/include-fixed"
compiler_flag: "-isystem"
compiler_flag: "/sw/summit/gcc/7.4.0/include/c++/7.4.0"
compiler_flag: "-isystem"
compiler_flag: "/sw/summit/gcc/7.4.0/lib/gcc/powerpc64le-none-linux-gnu/7.4.0/include"

Here we are now: summit-jaxlib.log

hawkinsp commented 3 years ago

@proutrc I'm wondering if that's related to building with gcc 7. See this similar looking issue for PyTorch: https://github.com/pytorch/pytorch/pull/50640 Try 8.1, since you have it?

You might be able to work around the problem by sending a pull request to Eigen that adds similar fallback definitions of the missing vector intrinsics.

proutrc commented 3 years ago

@hawkinsp I did indeed go that route, using 8.1.1 over the weekend and this morning.

Oddly, the undeclared inclusion(s) comes back! Seemingly further though:

https://gist.github.com/proutrc/d4bc637555d3624d8aa4ccf6a65f348f

hawkinsp commented 3 years ago

@proutrc Can you share the toolchain .bzl and BUILD files you are using in another gist?

proutrc commented 3 years ago

@hawkinsp BUILD: https://gist.github.com/proutrc/ba20b7f5d2c4b6ae7e26cfe4afdea44e cc_toolchain: https://gist.github.com/proutrc/4cb2bd0a3a4f804243e8134295079906

hawkinsp commented 3 years ago

I'd include both the /autofs and /sw paths just to make sure it doesn't help. (Including more paths should only help, not hurt.)

Beyond that I might try adding the compiler flags in the Bazel issue you link above.

Another suggestion is you might try clearing any bazel caches (https://stackoverflow.com/questions/43921911/how-to-resolve-bazel-undeclared-inclusions-error/48513577#48513577). Deleting ~/.cache/bazel would probably work.

proutrc commented 3 years ago

Thanks @hawkinsp, I will play with this more. I have added both /sw and /autofs paths before, but am going to try that again and will report back.

I have been setting these for cache, etc.. then clearing them every run (off NFS):

startup --output_user_root=/gpfs/alpine/stf007/scratch/rprout/bazel-build-cache/user-root build --disk_cache=/gpfs/alpine/stf007/scratch/rprout/bazel-cache/ export TEST_TMPDIR=/gpfs/alpine/stf007/scratch/rprout/bazel-tmp/

In addition to running /gpfs/alpine/stf007/scratch/rprout/bazel-4.1.0/output/bazel clean --expunge

Maybe I haven't found the right combo of things yet, not sure. But, Bazel definitely seems finicky about NFS.

hawkinsp commented 3 years ago

Is it possible to use non-NFS temporary and cache directories? I don't know if it will help, but it might.

proutrc commented 3 years ago

Is it possible to use non-NFS temporary and cache directories? I don't know if it will help, but it might.

I think that is what I am doing with these settings:

startup --output_user_root=/gpfs/alpine/stf007/scratch/rprout/bazel-build-cache/user-root build --disk_cache=/gpfs/alpine/stf007/scratch/rprout/bazel-cache/ export TEST_TMPDIR=/gpfs/alpine/stf007/scratch/rprout/bazel-tmp/

proutrc commented 3 years ago

@hawkinsp I can confirm I don't cache anything on NFS.

I also added the /sw and /autofs paths:

def _impl(ctx):
   return cc_common.create_cc_toolchain_config_info(
       ctx = ctx,
       features = features, # NEW
       cxx_builtin_include_directories = [
          #"/autofs/nccs-svm1_sw/summit/gcc/7.4.0/include",
          #"/autofs/nccs-svm1_sw/summit/gcc/7.4.0/lib/gcc/powerpc64le-none-linux-gnu/7.4.0/include",
          #"/autofs/nccs-svm1_sw/summit/gcc/7.4.0/lib/gcc/powerpc64le-none-linux-gnu/7.4.0/include-fixed",
          "/sw/summit/gcc/8.1.1/include",
          "/autofs/nccs-svm1_sw/summit/gcc/8.1.1/include",
          "/sw/summit/gcc/8.1.1/include/c++/8.1.1",
          "/autofs/nccs-svm1_sw/summit/gcc/8.1.1/lib/gcc/powerpc64le-unknown-linux-gnu/8.1.1/include",
          "/autofs/nccs-svm1_sw/summit/gcc/8.1.1/lib/gcc/powerpc64le-unknown-linux-gnu/8.1.1/include-fixed",
          "/sw/summit/gcc/8.1.1/lib/gcc/powerpc64le-unknown-linux-gnu/8.1.1/include",
          "/sw/summit/gcc/8.1.1/lib/gcc/powerpc64le-unknown-linux-gnu/8.1.1/include-fixed/",
          "/autofs/nccs-svm1_sw/summit/gcc/8.1.1/include/c++/8.1.1",
          "/usr/include",
       ]

It is strange too, because it is obviously able to compile things leading up to the inclusion failure:

......
INFO: Found 1 target...
[0 / 4] [Prepa] BazelWorkspaceStatusAction stable-status.txt
[79 / 535] Compiling src/google/protobuf/struct.pb.cc [for host]; 0s local, remote-cache ... (3 actions, 2 running)
[124 / 535] Compiling src/google/protobuf/compiler/java/java_message_field.cc [for host]; 0s local, remote-cache ... (3 actions, 2 running)
[178 / 535] Compiling src/google/protobuf/compiler/csharp/csharp_field_base.cc [for host]; 0s local, remote-cache ... (3 actions, 2 running)
[1,702 / 2,103] Compiling internal/wait.c [for host]; 0s local, remote-cache ... (3 actions, 2 running)
[1,814 / 2,174] Compiling external/org_tensorflow/tensorflow/core/framework/cost_graph.pb.cc [for host]; 1s local, remote-cache ... (3 actions, 2 running)
ERROR: /gpfs/alpine/stf007/scratch/rprout/bazel-build-cache/user-root/b2ebe10a0ad0f6175e81a930563cb9d3/external/com_github_grpc_grpc/BUILD:1883:16: Compiling src/core/ext/transport/chttp2/server/insecure/server_chttp2.cc failed: undeclared inclusion(s) in rule '@com_github_grpc_grpc//:grpc_transport_chttp2_server_insecure':
this rule is missing dependency declarations for the following files included by 'src/core/ext/transport/chttp2/server/insecure/server_chttp2.cc':
  '/sw/summit/gcc/8.1.1/lib/gcc/powerpc64le-unknown-linux-gnu/8.1.1/include/stdint.h'
  '/sw/summit/gcc/8.1.1/lib/gcc/powerpc64le-unknown-linux-gnu/8.1.1/include/stddef.h'
  '/sw/summit/gcc/8.1.1/lib/gcc/powerpc64le-unknown-linux-gnu/8.1.1/include/stdarg.h'
  '/sw/summit/gcc/8.1.1/include/c++/8.1.1/stdlib.h'
  '/sw/summit/gcc/8.1.1/include/c++/8.1.1/cstdlib'
.....
proutrc commented 3 years ago

@hawkinsp is our toolchain not being used everywhere by chance?

[599 / 2,232] Executing genrule @local_config_cuda//cuda:cuda-include; 3s local, remote-cache ... (4 actions running)
ERROR: /gpfs/alpine/stf007/scratch/rprout/bazel-build-cache/user-root/b2ebe10a0ad0f6175e81a930563cb9d3/external/org_tensorflow/tensorflow/core/platform/BUILD:453:11: Compiling tensorflow/core/platform/path.cc failed: undeclared inclusion(s) in rule '@org_tensorflow//tensorflow/core/platform:path':

Are there different "rules" or something? undeclared inclusion(s) in rule '@org_tensorflow//tensorflow/core/platform:path':

proutrc commented 3 years ago

@hawkinsp I did some digging around in my build-cache, here: /gpfs/alpine/stf007/scratch/rprout/bazel-build-cache/user-root/b2ebe10a0ad0f6175e81a930563cb9d3/execroot/__main__/external/

It looks like these external packages setup their own .bazelrc and possibly don't get our toolchain config. Is there a guarantee that what we set as the toolchain config in JAX makes it to these external packages?

f0uriest commented 3 years ago

I was able to build from main on Traverse without having to do any toolchain modifications, though I did have to manually specify the cuda/cudnn paths.

python build/build.py --enable_cuda --cuda_path /usr/local/cuda-11.3 --cuda_version=11.3 --cudnn_version=8.2.0 --cudnn_path /usr/local/cudnn/cuda-11.3/8.2.0 --noenable_mkl_dnn --cuda_compute_capabilities 7.0 --bazel_path /usr/bin/bazel --target_cpu=ppc

Thanks so much for your help with this! Hope the other ppc users can also get it working