RTNEURAL_DEFAULT_ALIGNMENT=8 on armv7, EIGEN backend

MaxPayne86 commented 1 month ago

Hi, using RTNeural 4a540403e115bae18d29142a5f54e7c3598b6e51

docker buildx create --name mybuilder
docker buildx use mybuilder
docker run -it --rm --privileged tonistiigi/binfmt --install all # Install all qemu emulators
docker run --rm -it -u $UID --platform linux/arm -v "$(pwd):/workdir" debian:buster-slim bash
apt-get update && apt-get install -y build-essential cmake
mkdir -p build && cd build
cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX="" -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON -DBUILD_BENCH=ON ../
make install DESTDIR="/workdir"

for the given env, cmake/SIMDExtensions.cmake set RTNEURAL_DEFAULT_ALIGNMENT=16. During build, I will see a lot of warnings like

warning: requested alignment 16 is larger than 8

this is related to Eigen backend compilation.

So far so good, I'm able to execute both dynamic and template

./rtneural_layer_bench lstm 10 1 12
Benchmarking lstm layer, with input size 1 and output size 12, with signal length 10 seconds
Processed 10 seconds of signal in 6.04064 seconds
1.65545x real-time                 
Testing templated implementation...             
Processed 10 seconds of signal in 4.94528 seconds
2.02213x real-time                 
Templated layer is 1.2215x faster!

now I temporarily edited cmake/SIMDExtensions.cmake and set RTNEURAL_DEFAULT_ALIGNMENT=8, then recompile. The build ends without errors, then when executing

./rtneural_layer_bench lstm 10 1 12
Benchmarking lstm layer, with input size 1 and output size 12, with signal length 10 seconds
Processed 10 seconds of signal in 6.02594 seconds
1.65949x real-time                               
Testing templated implementation...
rtneural_layer_bench: /workdir/modules/RTNeural/modules/Eigen/Eigen/src/Core/MapBase.h:201: void Eigen::MapBase<Derived, 0>::checkSanity(typename Eigen::internal::enable_if<(Eigen::internal::traits<OtherDerived>
::Alignment > 0), void*>::type) const [with T = Eigen::Map<Eigen::Matrix<float, 12, 1, 0, 12, 1>, 16, Eigen::Stride<0, 0> >; Derived = Eigen::Map<Eigen::Matrix<float, 12, 1, 0, 12, 1>, 16, Eigen::Stride<0, 0> >;
 typename Eigen::internal::enable_if<(Eigen::internal::traits<OtherDerived>::Alignment > 0), void*>::type = void*]: Assertion `( ((internal::UIntPtr(m_data) % internal::traits<Derived>::Alignment) == 0) || (cols
() * rows() * minInnerStride * sizeof(Scalar)) < internal::traits<Derived>::Alignment ) && "data is not aligned"' failed.
Aborted (core dumped)

I would conclude that RTNEURAL_DEFAULT_ALIGNMENT=8 is not supported by Eigen backend. However, it should be the correct alignment for a 32-bit processor.

NOTES: I'm not able to reporting XSIMD in this very same env since I'm experiencing several compilation errors, WIP -- The C compiler identification is GNU 8.3.0 -- The CXX compiler identification is GNU 8.3.0

jatinchowdhury18 commented 1 month ago

Interesting, thanks for sharing... I guess there's a couple of layers of things going on here.

My first question is if Eigen and XSIMD work on your target platform in the first place? It's totally possible that there's some incompatibilities going on at their level(s). However, if you're able to use Eigen or XSIMD on your platform outside of RTNeural, then obviously there's some things we'll need to change in RTNeural.

For the CMake configuration, we should probably make a stronger effort to check what the "correct" alignment is for the target platform, rather than setting it to 8 by default... is this something that you might know how to do? We might also want to work out a way for the user to "manually" override the default alignment without needing to edit RTNeural's CMake config.

The provided crash log is also a bit curious to me... specifically looking at the line T = Eigen::Map<Eigen::Matrix<float, 12, 1, 0, 12, 1>, 16, Eigen::Stride<0, 0>>. The template type definition for Eigen::Map is Eigen::Map<MatrixType, Alignment, Stride>, implying that Eigen still thinks that the requested alignment is 16 bytes. Would is be possible to add a check (e.g. static_assert (RTNEURAL_DEFAULT_ALIGNMENT == 8)), somewhere in your code, just to double-check that?

MaxPayne86 commented 3 weeks ago

if Eigen and XSIMD work on your target platform in the first place?

Interesting point, are they packetized in debian? If so I can install the package and then run some built-in tests?

is this something that you might know how to do?

I would put in cmake/SIMDExtensions.cmake

if(NOT RTNEURAL_USE_AVX)
    if(CMAKE_SYSTEM_PROCESSOR MATCHES "armv7")
        target_compile_definitions(RTNeural PUBLIC RTNEURAL_DEFAULT_ALIGNMENT=8)
    else()
        target_compile_definitions(RTNeural PUBLIC RTNEURAL_DEFAULT_ALIGNMENT=16)
    endif()
else()

implying that Eigen still thinks that the requested alignment is 16 bytes

Yes, just checked RTNeural code now, RTNeural/common.h

#if RTNEURAL_DEFAULT_ALIGNMENT == 32
constexpr auto RTNeuralEigenAlignment = Eigen::Aligned32;
#else
constexpr auto RTNeuralEigenAlignment = Eigen::Aligned16;
#endif

it seems to me we need to also expand this to allow for Eigen::Aligned8

MaxPayne86 commented 3 weeks ago

UPDATE: so I've moved forward by adding in RTNeural/common.h

#if RTNEURAL_DEFAULT_ALIGNMENT == 32
    constexpr auto RTNeuralEigenAlignment = Eigen::Aligned32;
#elif RTNEURAL_DEFAULT_ALIGNMENT == 16
    constexpr auto RTNeuralEigenAlignment = Eigen::Aligned16;
#elif RTNEURAL_DEFAULT_ALIGNMENT == 8
    constexpr auto RTNeuralEigenAlignment = Eigen::Aligned8;
#else
    #error "Unsupported alignment"
#endif

but during compilation I still see warnings such as

RTNeural/modules/Eigen/Eigen/src/Core/arch/NEON/Complex.h:281:37: warning: requested alignment 16 is larger than 8 [-Wattributes]

by opening the above file, I see

template<> EIGEN_STRONG_INLINE std::complex<float> pfirst<Packet1cf>(const Packet1cf& a)
{
  EIGEN_ALIGN16 std::complex<float> x;
  vst1_f32(reinterpret_cast<float*>(&x), a.v);
  return x;
}
template<> EIGEN_STRONG_INLINE std::complex<float> pfirst<Packet2cf>(const Packet2cf& a)
{
  EIGEN_ALIGN16 std::complex<float> x[2];
  vst1q_f32(reinterpret_cast<float*>(x), a.v);
  return x[0];
}

but I was able to spot other warnings in RTNeural/modules/Eigen/Eigen/src/Core/arch/Default/GenericPacketMathFunctions.h, still need to look deeply

jatinchowdhury18 commented 3 weeks ago

Interesting point, are they packetized in debian? If so I can install the package and then run some built-in tests?

I was more thinking you could try to compile a program using either Eigen or XSIMD, but without using RTNeural. I'm not sure if the libraries are packaged in any way, but the source code for both libraries is available on GitHub/GitLab, and in both cases I believe the source code includes some example programs that you could try compiling and running.

The proposed changes to SIMDExtensions.cmake and common.h look correct to me! If you'd like to make a pull request with those changes, that would be great!

The remaining warnings coming from the Eigen headers are likely there because Eigen wants some data types to be guaranteed to be aligned to 16 bytes, although I'm not 100% sure what their reasons are for wanting that. For the most part RTNeural doesn't really interact with the type information that is then passed to Eigen. For example, an RTNeural::Dense<float> will likely result in the creation of an Eigen::Matrix<float>. So you probably don't actually need std::complex<float> and may want to modify Eigen to reflect that, which would likely silence those warnings.

In all, I believe the remaining warnings that you're seeing are happening because of the relationship between your compiler/toolchain and Eigen, and aren't directly related to RTNeural.

MaxPayne86 commented 1 week ago

Update on this issue

The proposed changes to SIMDExtensions.cmake and common.h look correct to me! If you'd like to make a pull request with those changes, that would be great!

Completed as per PR merge

So you probably don't actually need std::complex

I confirm that, tested with proposed PR in place I am now able to execute template impl. without crashes

./rtneural_layer_bench lstm 10 1 12
Benchmarking lstm layer, with input size 1 and output size 12, with signal length 10 seconds
Processed 10 seconds of signal in 7.71392 seconds
1.29636x real-time
Testing templated implementation...
Processed 10 seconds of signal in 6.50054 seconds
1.53833x real-time
Templated layer is 1.18666x faster!

however, as you can notice RTNEURAL_DEFAULT_ALIGNMENT=8 seems less performant than RTNEURAL_DEFAULT_ALIGNMENT=16 on a 32-bit processor using EIGEN.

XSIMD wip...

jatinchowdhury18 commented 1 week ago

Awesome! It does make sense that using 8-byte alignment will be slower than 16-byte alignment if Eigen is trying to use certain SIMD intrinsics, since they'll probably end up needing to do a lot more "unaligned load" operations. Obviously that will depend on the platform architecture, and the specifics of what Eigen is trying to do under the hood.

jatinchowdhury18 / RTNeural

RTNEURAL_DEFAULT_ALIGNMENT=8 on armv7, EIGEN backend #139