PPC64le: vsx_helpers.h errors

ax3l commented 1 year ago

🐛 Describe the bug

Hi,

I am compiling pytorch 2.1.0-rc3 from source on RHEL8 using the PPC64LE CPU architecture and CUDA support (7.0 for V100).

python3 -m pip install -r requirements.txt
USE_CUDA=1 BLAS=OpenBLAS MAX_JOBS=64 ATEN_AVX512_256=OFF BUILD_TEST=0 python3 setup.py develop

In file included from /g/g90/huebl1/src/pytorch/aten/src/ATen/native/Col2Im.cpp:6:
In file included from /g/g90/huebl1/src/pytorch/aten/src/ATen/native/im2col.h:7:
In file included from /g/g90/huebl1/src/pytorch/aten/src/ATen/native/cpu/utils.h:4:
In file included from /g/g90/huebl1/src/pytorch/aten/src/ATen/cpu/vec/vec.h:6:
In file included from /g/g90/huebl1/src/pytorch/aten/src/ATen/cpu/vec/vec256/vec256.h:19:
In file included from /g/g90/huebl1/src/pytorch/aten/src/ATen/cpu/vec/vec256/vsx/vec256_common_vsx.h:5:
/g/g90/huebl1/src/pytorch/aten/src/ATen/cpu/vec/vec256/vsx/vsx_helpers.h:57:10: error: excess elements in scalar initializer
  vint16 vint0 = {0, 0, 0, 0 ,0, 0, 0, 0};
         ^         ~~~~~~~~~~~~~~~~~~~~~~
/g/g90/huebl1/src/pytorch/aten/src/ATen/cpu/vec/vec256/vsx/vsx_helpers.h:58:10: error: no matching function for call to 'vec_vsubuhm'
  return vec_vsubuhm(vint0, vec_in);
         ^~~~~~~~~~~
/usr/tce/packages/clang/clang-12.0.1/release/lib/clang/12.0.1/include/altivec.h:11484:45: note: candidate function not viable: no known conversion from 'vint16' (aka 'short') to '__vector short' (vector of 8 'short' values) for 1st argument
static __inline__ vector short __ATTRS_o_ai vec_vsubuhm(vector short __a,
                                            ^
/usr/tce/packages/clang/clang-12.0.1/release/lib/clang/12.0.1/include/altivec.h:11489:45: note: candidate function not viable: no known conversion from 'vint16' (aka 'short') to '__vector __bool unsigned short' (vector of 8 'unsigned short' values) for 1st argument
static __inline__ vector short __ATTRS_o_ai vec_vsubuhm(vector bool short __a,
                                            ^
/usr/tce/packages/clang/clang-12.0.1/release/lib/clang/12.0.1/include/altivec.h:11494:45: note: candidate function not viable: no known conversion from 'vint16' (aka 'short') to '__vector short' (vector of 8 'short' values) for 1st argument
static __inline__ vector short __ATTRS_o_ai vec_vsubuhm(vector short __a,
                                            ^
/usr/tce/packages/clang/clang-12.0.1/release/lib/clang/12.0.1/include/altivec.h:11500:1: note: candidate function not viable: no known conversion from 'vint16' (aka 'short') to '__vector unsigned short' (vector of 8 'unsigned short' values) for 1st argument
vec_vsubuhm(vector unsigned short __a, vector unsigned short __b) {
^
/usr/tce/packages/clang/clang-12.0.1/release/lib/clang/12.0.1/include/altivec.h:11505:1: note: candidate function not viable: no known conversion from 'vint16' (aka 'short') to '__vector __bool unsigned short' (vector of 8 'unsigned short' values) for 1st argument
vec_vsubuhm(vector bool short __a, vector unsigned short __b) {
^
/usr/tce/packages/clang/clang-12.0.1/release/lib/clang/12.0.1/include/altivec.h:11510:1: note: candidate function not viable: no known conversion from 'vint16' (aka 'short') to '__vector unsigned short' (vector of 8 'unsigned short' values) for 1st argument
vec_vsubuhm(vector unsigned short __a, vector bool short __b) {
^

Using ATEN_AVX512_256=ON leads to the same errors.

Full log: torch.zip

Versions

pytorch 2.1.0-rc3
CUDA 12.0.76
Clang 12.0.1
RHEL 8.7

cc @malfet @seemethere

ax3l commented 1 year ago

Similarish to @kvndhrty's report in #97497

Maybe interesting for @cdeepali @jgong5 @quickwritereader as of #98511

quickwritereader commented 1 year ago

I see. its better to write it as a-b @ax3l could you rewrite it that way. I believe I was using intrinsic way, because I thought in future they maybe direct instruction

ax3l commented 1 year ago

Hi @quickwritereader, happy to help and test. What do you mean with a-b exactly? I think multiple lines might be affected and my feeling is that the attributes in defines like https://github.com/pytorch/pytorch/blob/v2.1.0-rc3/aten/src/ATen/cpu/vec/vec256/vsx/vsx_helpers.h#L11 might not working in clang-12 and thus appear as scalars.

quickwritereader commented 1 year ago

so instead of return {vec_neg(_vec0), vec_neg(_vec1)}; rewrite it as

return {-_vec0, -_vec1};

or for each type

 vint16 vint0 = {};
return {vint0 -_vec0, vint0 -_vec1};

and remove the vec_neg from all and also from the headers

ax3l commented 1 year ago

Got it, thanks! I am off in my timezone now, but can push something in a few days :)

quickwritereader commented 1 year ago

great. You could also write it this way for example for int32_t

  Vectorized<int32_t> C10_ALWAYS_INLINE neg() const {
    return Vectorized<int32_t>(0)- *this;
  }

this is more readable as well. But this below is shorter.

return {-_vec0, -_vec1};

ax3l commented 1 year ago

I replaced the neg functions locally, but there are more issues in those files showing up with Clang 12.0.1.

I am wondering if this is a Clang/LLVM defect, e.g., a missing compiler flag or intrinsic implementation for altivec.

I tried again with GCC 11.2.1:

python3 -m pip install -r requirements.txt
rm -rf build
CC=gcc CXX=g++ USE_CUDA=1 BLAS=OpenBLAS MAX_JOBS=64 ATEN_AVX512_256=OFF BUILD_TEST=0 python3 setup.py develop

This compiles the altivec intrinsics in aten well, but fails in the link step, which probably needs its own issue #108984:

[2310/2316] Linking CXX executable bin/torch_shm_manager
FAILED: bin/torch_shm_manager 
: && /usr/tcetmp/bin/g++ -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow -O3 -DNDEBUG -DNDEBUG -rdynamic -L/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-2023.02.10/lib -pthread @CMakeFiles/torch_shm_manager.rsp -o bin/torch_shm_manager  && :
/g/g90/huebl1/src/pytorch/build/lib/libtorch_cuda.so: undefined reference to `at::cuda::jit::generate_reduction_code(at::cuda::jit::KernelDescriptor const&, int, bool, bool, int, int)'
/g/g90/huebl1/src/pytorch/build/lib/libtorch_cuda.so: undefined reference to `c10::Error::Error(c10::SourceLocation, std::string)'
/g/g90/huebl1/src/pytorch/build/lib/libtorch_cuda.so: undefined reference to `c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::string const&)'
/g/g90/huebl1/src/pytorch/build/lib/libtorch_cuda.so: undefined reference to `torch::Library::Library(torch::Library::Kind, std::string, c10::optional<c10::DispatchKey>, char const*, unsigned int)'
/g/g90/huebl1/src/pytorch/build/lib/libtorch_cuda.so: undefined reference to `at::cuda::jit::generate_code(at::cuda::jit::KernelDescriptor const&, bool, bool, at::cuda::jit::BinaryFuncVariant, bool, int, bool)'
/g/g90/huebl1/src/pytorch/build/lib/libtorch_cuda.so: undefined reference to `at::cuda::jit::generate_code(int, int, std::string const&, std::string const&, std::string const&, std::string const&, std::string const&, bool, bool, at::cuda::jit::BinaryFuncVariant, c10::SmallVector<std::string, 6u>&, bool, int, bool)'
/g/g90/huebl1/src/pytorch/build/lib/libtorch_cuda.so: undefined reference to `c10::DeviceTypeName(c10::DeviceType, bool)'
/g/g90/huebl1/src/pytorch/build/lib/libtorch_cuda.so: undefined reference to `at::TensorBase::toString() const'
/g/g90/huebl1/src/pytorch/build/lib/libtorch_cuda.so: undefined reference to `c10::Device::Device(std::string const&)'
/g/g90/huebl1/src/pytorch/build/lib/libtorch_cuda.so: undefined reference to `c10::detail::LogAPIUsageFakeReturn(std::string const&)'
/g/g90/huebl1/src/pytorch/build/lib/libtorch_cuda.so: undefined reference to `c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&)'
/g/g90/huebl1/src/pytorch/build/lib/libtorch_cuda.so: undefined reference to `at::cuda::jit::jit_pwise_function(std::string const&, std::string const&)'
/g/g90/huebl1/src/pytorch/build/lib/libtorch_cuda.so: undefined reference to `c10::Warning::Warning(c10::variant<c10::Warning::UserWarning, c10::Warning::DeprecationWarning>, c10::SourceLocation const&, std::string, bool)'
collect2: error: ld returned 1 exit status
[2312/2316] Linking CXX shared library lib/libtorch_python.so
ninja: build stopped: subcommand failed.

quickwritereader commented 1 year ago

what was the problem using clang after changes?

ax3l commented 1 year ago

I tried again and cannot reproduce a problem with clang 12 after the fix.

Posting a fix in #108985

ax3l commented 1 year ago

This fix helps with the compile error, but testing it I get an issue:

$ python3 -c "import torch"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/g/g90/huebl1/src/pytorch/torch/__init__.py", line 234, in <module>
    _load_global_deps()
  File "/g/g90/huebl1/src/pytorch/torch/__init__.py", line 193, in _load_global_deps
    raise err
  File "/g/g90/huebl1/src/pytorch/torch/__init__.py", line 174, in _load_global_deps
    ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
  File "/collab/usr/gapps/python/build/spack-coralea.4/var/spack/environments/python/.spack-env/view/lib/python3.8/ctypes/__init__.py", line 373, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /g/g90/huebl1/src/pytorch/torch/lib/libtorch_global_deps.so: cannot open shared object file: No such file or directory

ax3l commented 1 year ago

No, there is more... the return code of the install was zero, but I see this in the logs that the const qualifier at a free standing function needs to be removed... Will update PR.

ax3l commented 1 year ago

Looking at #108985, I think I do not understand your guidance. Do you want me to implement free standing functions or some member functions to the at::vec::Vectorized class somewhere else? :)

quickwritereader commented 1 year ago

yes I just wanted you to remove integer ones. And add the lines inside

ax3l commented 1 year ago

Can you please comment inline in #108985, sorry for not understanding the structure of this file.

ax3l commented 1 year ago

I replaced the neg functions locally, but there are more issues in those files showing up with Clang 12.0.1.

I am wondering if this is a Clang/LLVM defect, e.g., a missing compiler flag or intrinsic implementation for altivec.

Other issues I see now are:

/g/g90/huebl1/src/pytorch/aten/src/ATen/cpu/vec/vec256/vsx/vsx_helpers.h:138:1: error: no matching function for call to 'vec_cmpne'
C10_VSX_VEC_NAN_PROPAG(vec_max_nan2, vfloat32, vbool32, vec_max)
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/g/g90/huebl1/src/pytorch/aten/src/ATen/cpu/vec/vec256/vsx/vsx_helpers.h:132:19: note: expanded from macro 'C10_VSX_VEC_NAN_PROPAG'
    btype nan_b = vec_cmpne(b, b);                            \
                  ^~~~~~~~~
/usr/tce/packages/clang/clang-12.0.1/release/lib/clang/12.0.1/include/altivec.h:1911:1: note: candidate function not viable: no known conversion from 'const vfloat32' (aka 'const float') to '__vector __bool unsigned char' (vector of 16 'unsigned char' values) for 1st argument
vec_cmpne(vector bool char __a, vector bool char __b) {
^
...

quickwritereader commented 1 year ago

https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cpu/vec/vec256/vsx/vsx_helpers.h#L43-L70


#if !defined(vec_neg)
C10_ALWAYS_INLINE vfloat32 vec_neg(const vfloat32& vec_in) {
  vfloat32 vec_out;
  __asm__("xvnegsp %x0,%x1" : "=wf"(vec_out) : "wf"(vec_in));
  return vec_out;
}

C10_ALWAYS_INLINE vfloat64 vec_neg(const vfloat64& vec_in) {
  vfloat64 vec_out;
  __asm__("xvnegdp %x0,%x1" : "=wd"(vec_out) : "wd"(vec_in));
  return vec_out;
}

#endif

quickwritereader commented 1 year ago

https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cpu/vec/vec256/vsx/vec256_int16_vsx.h#L309-L311

  Vectorized<int16_t> C10_ALWAYS_INLINE neg() const {
    return {-_vec0, -_vec1};
  }

https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cpu/vec/vec256/vsx/vec256_int32_vsx.h#L240-L242

  Vectorized<int32_t> C10_ALWAYS_INLINE neg() const {
        return {-_vec0, -_vec1};
  }

https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cpu/vec/vec256/vsx/vec256_int64_vsx.h#L192-L194

  Vectorized<int64_t> C10_ALWAYS_INLINE neg() const {
     return {-_vec0, -_vec1};
  }

quickwritereader commented 1 year ago

See if it works it could be written as such as well.

 Vectorized<int64_t> C10_ALWAYS_INLINE neg() const {
return Vectorized<int64_t>(0) - *this;
}

ax3l commented 1 year ago

Thank you, got it now. Pushed and testing now :hammer_and_wrench:

quickwritereader commented 1 year ago

thanks as well. let's see what pops after.
vec_cmpne ?

ax3l commented 1 year ago

Yes, quite a few: pytorch_clang12.zip

/g/g90/huebl1/src/pytorch/aten/src/ATen/cpu/vec/vec256/vsx/vsx_helpers.h:129:57: error: no matching function for call to 'vec_min'
...
/g/g90/huebl1/src/pytorch/aten/src/ATen/cpu/vec/vec256/vsx/vsx_helpers.h:129:1: error: no matching function for call to 'vec_cmpne'
...
/g/g90/huebl1/src/pytorch/aten/src/ATen/cpu/vec/vec256/vsx/vsx_helpers.h:129:1: error: no matching function for call to 'vec_cmpne'
...
/g/g90/huebl1/src/pytorch/aten/src/ATen/cpu/vec/vec256/vsx/vsx_helpers.h:132:1: error: no matching function for call to 'vec_sel'
...
fatal error: too many errors emitted, stopping now [-ferror-limit=]

Since the same code compiles on ppc64le witch GCC 11.2.1, my impression is that this is clang-specific (missing flag or missing implementation in LLVM)...?

quickwritereader commented 1 year ago

something wrong with the header. We can't workaround all. lemme see what might be wrong in godbolt

quickwritereader commented 1 year ago

I could not find clang 12 for ppc64le there.
it seems its better to use gcc, especially the ones provided by ibm

ax3l commented 1 year ago

I found a power64le clang (trunk): https://godbolt.org/z/Eh6ezYe4c

quickwritereader commented 1 year ago

it seems it's better not to use clang ppc64le. I am afraid it does not work properly at all. let's close the PR then.

quickwritereader commented 1 year ago

and one needs to check the clang's new versions and how it vectorizes codes. so it is safer to use GCC, especially https://www.ibm.com/support/pages/advance-toolchain-linux-power

ax3l commented 1 year ago

I agree, that seems indeed to be the best course of action for now. Thank you for all your help, sorry that we could not find a solution to Clang.

pytorch / pytorch

PPC64le: vsx_helpers.h errors #108934

🐛 Describe the bug

Versions