Open ax3l opened 1 year ago
Similarish to @kvndhrty's report in #97497
Maybe interesting for @cdeepali @jgong5 @quickwritereader as of #98511
I see. its better to write it as a-b @ax3l could you rewrite it that way. I believe I was using intrinsic way, because I thought in future they maybe direct instruction
Hi @quickwritereader,
happy to help and test. What do you mean with a-b
exactly? I think multiple lines might be affected and my feeling is that the attributes in defines like
https://github.com/pytorch/pytorch/blob/v2.1.0-rc3/aten/src/ATen/cpu/vec/vec256/vsx/vsx_helpers.h#L11
might not working in clang-12 and thus appear as scalars.
so instead of return {vec_neg(_vec0), vec_neg(_vec1)}; rewrite it as
return {-_vec0, -_vec1};
or for each type
vint16 vint0 = {};
return {vint0 -_vec0, vint0 -_vec1};
and remove the vec_neg from all and also from the headers
Got it, thanks! I am off in my timezone now, but can push something in a few days :)
great. You could also write it this way for example for int32_t
Vectorized<int32_t> C10_ALWAYS_INLINE neg() const {
return Vectorized<int32_t>(0)- *this;
}
this is more readable as well. But this below is shorter.
return {-_vec0, -_vec1};
I replaced the neg functions locally, but there are more issues in those files showing up with Clang 12.0.1.
I am wondering if this is a Clang/LLVM defect, e.g., a missing compiler flag or intrinsic implementation for altivec.
I tried again with GCC 11.2.1:
python3 -m pip install -r requirements.txt
rm -rf build
CC=gcc CXX=g++ USE_CUDA=1 BLAS=OpenBLAS MAX_JOBS=64 ATEN_AVX512_256=OFF BUILD_TEST=0 python3 setup.py develop
This compiles the altivec intrinsics in aten well, but fails in the link step, which probably needs its own issue #108984:
[2310/2316] Linking CXX executable bin/torch_shm_manager
FAILED: bin/torch_shm_manager
: && /usr/tcetmp/bin/g++ -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow -O3 -DNDEBUG -DNDEBUG -rdynamic -L/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-2023.02.10/lib -pthread @CMakeFiles/torch_shm_manager.rsp -o bin/torch_shm_manager && :
/g/g90/huebl1/src/pytorch/build/lib/libtorch_cuda.so: undefined reference to `at::cuda::jit::generate_reduction_code(at::cuda::jit::KernelDescriptor const&, int, bool, bool, int, int)'
/g/g90/huebl1/src/pytorch/build/lib/libtorch_cuda.so: undefined reference to `c10::Error::Error(c10::SourceLocation, std::string)'
/g/g90/huebl1/src/pytorch/build/lib/libtorch_cuda.so: undefined reference to `c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::string const&)'
/g/g90/huebl1/src/pytorch/build/lib/libtorch_cuda.so: undefined reference to `torch::Library::Library(torch::Library::Kind, std::string, c10::optional<c10::DispatchKey>, char const*, unsigned int)'
/g/g90/huebl1/src/pytorch/build/lib/libtorch_cuda.so: undefined reference to `at::cuda::jit::generate_code(at::cuda::jit::KernelDescriptor const&, bool, bool, at::cuda::jit::BinaryFuncVariant, bool, int, bool)'
/g/g90/huebl1/src/pytorch/build/lib/libtorch_cuda.so: undefined reference to `at::cuda::jit::generate_code(int, int, std::string const&, std::string const&, std::string const&, std::string const&, std::string const&, bool, bool, at::cuda::jit::BinaryFuncVariant, c10::SmallVector<std::string, 6u>&, bool, int, bool)'
/g/g90/huebl1/src/pytorch/build/lib/libtorch_cuda.so: undefined reference to `c10::DeviceTypeName(c10::DeviceType, bool)'
/g/g90/huebl1/src/pytorch/build/lib/libtorch_cuda.so: undefined reference to `at::TensorBase::toString() const'
/g/g90/huebl1/src/pytorch/build/lib/libtorch_cuda.so: undefined reference to `c10::Device::Device(std::string const&)'
/g/g90/huebl1/src/pytorch/build/lib/libtorch_cuda.so: undefined reference to `c10::detail::LogAPIUsageFakeReturn(std::string const&)'
/g/g90/huebl1/src/pytorch/build/lib/libtorch_cuda.so: undefined reference to `c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&)'
/g/g90/huebl1/src/pytorch/build/lib/libtorch_cuda.so: undefined reference to `at::cuda::jit::jit_pwise_function(std::string const&, std::string const&)'
/g/g90/huebl1/src/pytorch/build/lib/libtorch_cuda.so: undefined reference to `c10::Warning::Warning(c10::variant<c10::Warning::UserWarning, c10::Warning::DeprecationWarning>, c10::SourceLocation const&, std::string, bool)'
collect2: error: ld returned 1 exit status
[2312/2316] Linking CXX shared library lib/libtorch_python.so
ninja: build stopped: subcommand failed.
what was the problem using clang after changes?
I tried again and cannot reproduce a problem with clang 12 after the fix.
Posting a fix in #108985
This fix helps with the compile error, but testing it I get an issue:
$ python3 -c "import torch"
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/g/g90/huebl1/src/pytorch/torch/__init__.py", line 234, in <module>
_load_global_deps()
File "/g/g90/huebl1/src/pytorch/torch/__init__.py", line 193, in _load_global_deps
raise err
File "/g/g90/huebl1/src/pytorch/torch/__init__.py", line 174, in _load_global_deps
ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
File "/collab/usr/gapps/python/build/spack-coralea.4/var/spack/environments/python/.spack-env/view/lib/python3.8/ctypes/__init__.py", line 373, in __init__
self._handle = _dlopen(self._name, mode)
OSError: /g/g90/huebl1/src/pytorch/torch/lib/libtorch_global_deps.so: cannot open shared object file: No such file or directory
No, there is more... the return code of the install was zero, but I see this in the logs that the const
qualifier at a free standing function needs to be removed... Will update PR.
Looking at #108985, I think I do not understand your guidance. Do you want me to implement free standing functions or some member functions to the at::vec::Vectorized
class somewhere else? :)
yes I just wanted you to remove integer ones. And add the lines inside
Can you please comment inline in #108985, sorry for not understanding the structure of this file.
I replaced the neg functions locally, but there are more issues in those files showing up with Clang 12.0.1.
I am wondering if this is a Clang/LLVM defect, e.g., a missing compiler flag or intrinsic implementation for altivec.
Other issues I see now are:
/g/g90/huebl1/src/pytorch/aten/src/ATen/cpu/vec/vec256/vsx/vsx_helpers.h:138:1: error: no matching function for call to 'vec_cmpne'
C10_VSX_VEC_NAN_PROPAG(vec_max_nan2, vfloat32, vbool32, vec_max)
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/g/g90/huebl1/src/pytorch/aten/src/ATen/cpu/vec/vec256/vsx/vsx_helpers.h:132:19: note: expanded from macro 'C10_VSX_VEC_NAN_PROPAG'
btype nan_b = vec_cmpne(b, b); \
^~~~~~~~~
/usr/tce/packages/clang/clang-12.0.1/release/lib/clang/12.0.1/include/altivec.h:1911:1: note: candidate function not viable: no known conversion from 'const vfloat32' (aka 'const float') to '__vector __bool unsigned char' (vector of 16 'unsigned char' values) for 1st argument
vec_cmpne(vector bool char __a, vector bool char __b) {
^
...
https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cpu/vec/vec256/vsx/vsx_helpers.h#L43-L70
#if !defined(vec_neg)
C10_ALWAYS_INLINE vfloat32 vec_neg(const vfloat32& vec_in) {
vfloat32 vec_out;
__asm__("xvnegsp %x0,%x1" : "=wf"(vec_out) : "wf"(vec_in));
return vec_out;
}
C10_ALWAYS_INLINE vfloat64 vec_neg(const vfloat64& vec_in) {
vfloat64 vec_out;
__asm__("xvnegdp %x0,%x1" : "=wd"(vec_out) : "wd"(vec_in));
return vec_out;
}
#endif
Vectorized<int16_t> C10_ALWAYS_INLINE neg() const {
return {-_vec0, -_vec1};
}
Vectorized<int32_t> C10_ALWAYS_INLINE neg() const {
return {-_vec0, -_vec1};
}
Vectorized<int64_t> C10_ALWAYS_INLINE neg() const {
return {-_vec0, -_vec1};
}
See if it works it could be written as such as well.
Vectorized<int64_t> C10_ALWAYS_INLINE neg() const {
return Vectorized<int64_t>(0) - *this;
}
Thank you, got it now. Pushed and testing now :hammer_and_wrench:
thanks as well. let's see what pops after.
vec_cmpne ?
Yes, quite a few: pytorch_clang12.zip
/g/g90/huebl1/src/pytorch/aten/src/ATen/cpu/vec/vec256/vsx/vsx_helpers.h:129:57: error: no matching function for call to 'vec_min'
...
/g/g90/huebl1/src/pytorch/aten/src/ATen/cpu/vec/vec256/vsx/vsx_helpers.h:129:1: error: no matching function for call to 'vec_cmpne'
...
/g/g90/huebl1/src/pytorch/aten/src/ATen/cpu/vec/vec256/vsx/vsx_helpers.h:129:1: error: no matching function for call to 'vec_cmpne'
...
/g/g90/huebl1/src/pytorch/aten/src/ATen/cpu/vec/vec256/vsx/vsx_helpers.h:132:1: error: no matching function for call to 'vec_sel'
...
fatal error: too many errors emitted, stopping now [-ferror-limit=]
Since the same code compiles on ppc64le witch GCC 11.2.1, my impression is that this is clang-specific (missing flag or missing implementation in LLVM)...?
something wrong with the header. We can't workaround all. lemme see what might be wrong in godbolt
I could not find clang 12 for ppc64le there.
it seems its better to use gcc, especially the ones provided by ibm
I found a power64le clang (trunk)
: https://godbolt.org/z/Eh6ezYe4c
it seems it's better not to use clang ppc64le. I am afraid it does not work properly at all. let's close the PR then.
and one needs to check the clang's new versions and how it vectorizes codes. so it is safer to use GCC, especially https://www.ibm.com/support/pages/advance-toolchain-linux-power
I agree, that seems indeed to be the best course of action for now. Thank you for all your help, sorry that we could not find a solution to Clang.
🐛 Describe the bug
Hi,
I am compiling pytorch 2.1.0-rc3 from source on RHEL8 using the PPC64LE CPU architecture and CUDA support (7.0 for V100).
Using
ATEN_AVX512_256=ON
leads to the same errors.Full log: torch.zip
Versions
cc @malfet @seemethere