jfalcou / eve

Expressive Vector Engine - SIMD in C++ Goes Brrrr
https://jfalcou.github.io/eve/
Boost Software License 1.0
963 stars 58 forks source link

scalar horner algo using simd computations #1128

Closed jtlap closed 2 years ago

jtlap commented 2 years ago

I wrote this:

template<eve::floating_scalar_value T> inline
auto scalar_reverse_horner(std::vector<T> const & a, T x) noexcept
{
  using w_t =   eve::wide<T>;
  auto zz =  T(1);
  w_t xx([&zz, x](auto i,  auto){ auto z = zz; zz *= x; return z;}); //exclusive scan -> [1,x,..,x^(N-1)} 
  int N = eve::cardinal_v<w_t>;
  auto xc = zz;
  int nb = a.size()/N;
  w_t s(0);
  for(int i=0; i < nb*N; i+= N )
  {
    auto v =  eve::load(&a[i]);
    s = eve::fam(s, v, xx);
    xx *= xc;
  }
  auto rs = eve::reduce(s);

  auto xl = xx.get(0);
  for(int i=nb*N; i < a.size(); ++i)
  {
    rs = eve::fam(rs, a[i], xl);
    xl *= x;
  }
  return rs;
}

This is an internally simd version of the reverse horner algorithm which computes the value of a polynomial at a scalar floating point value x using a vector containing the polynomial coefficients in increasing order. In the purely scalar version there is a loop is to compute from s=0; looping from 0 to siz-1 by 1 (siz being the size of a)

 s=s+x*a[i] ;

over the coefficients.

In this version the loop is here replaced by a shorter one from 0 to siz-1 by N (instead of 1)

s = s+ [1,x,..., x^(N-1)]*{a[i],...a[i+N-1]};
xx *= x^N;

where N is the number of lanes of the simd vector ( Of course [1,x,..., x^(N-1)] and x^N are computed once.) This loop is then followed by a reduce to sum the N lanes and perhaps by a scalar ending. if N does not divides siz.

Can this be written using eve algorithms or with the extension of some algorithms?

At least it seems it will need a slight extension of reduce using 2 different op one for the simd accumulation and tone for he final wide reduce (will be eve::add here), but I tried that and was not fully successful. (Of course the algo::reduce first pass (the main loop) must be unaligned as spurious leading zeros in a are not wanted)

DenisYaroshevskiy commented 2 years ago

This

    xx *= xc;

Seems wrong

I suspect, you were meant to scan it.

Can you just post a scalar version of the algorithm and then it'd be easier to figure out.

DenisYaroshevskiy commented 2 years ago

I found this online: https://en.wikipedia.org/wiki/Horner%27s_method

Is this the algorithm you wanted?

Didn't test anything, but I think the code should looks smth like:

double horners_method(std::span<const double> coefficients, double point)
{  
   eve::wide<double> xs (x);

   double x_to_the_nth = x;
   for (int i = eve::wide<double>::size(); i; --i) x_to_the_nth * x;

   eve::wide<double> xs_step (x_to_the_nth);

   xs.set(0, 1.0);
   xs = eve::scan(xs, eve::mul);

   eve::wide<double> sums(0.0);
   // remove aligning to avoid dealing with not matching begin.
   // we can get the aligning and it's a good idea, because we are read only but this is simpler for a sketch
   eve::algo::for_each[eve::algo::no_aligning](coeffients, [](auto it, auto ignore) {
     auto coeffs = eve::load[ignore.else_(0.0)](it);
     sums = eve::fam(coeffs, xs);
     xs *= xs_step;
   });

   return eve::reduce(sums);
}
jtlap commented 2 years ago

Ok it seems to be like that and I will try using scan and for_each. But it seems to me that xs = eve::scan(xs, eve::mul); produces a wide full of 1 ? Am I wrong ? mine get {1,x,xx,xxx} if the cardinal is 4 and at each iteration I multiply by xc that is just the scalar xxxx multiplying by the right coeffs and summing all that gives 4 partial sums of the polynomial that I finally give to reduce. Here is my last version of this reverse_horner algorithm with simd guts : (the else part is evaluating scalar polynomial on simd elements which existed yet)

EVE_FORCEINLINE constexpr auto reversehorner(EVESUPPORTS(cpu) , T x, R const & r) noexcept { constexpr size_t N = eve::cardinal_v; const size_t sz = r.size(); if (sz >= N && scalar_value && scalar_value) { using w_t = eve::wide; size_t N = eve::cardinal_v; size_t const sz = r.size(); { size_t const nb = sz/N; size_t nbN = nbN; auto zz = T(1); w_t xx([&zz, x](auto , auto){ auto z = zz; zz = x; return z;}); size_t N = eve::cardinal_v; auto xc = zz; w_t s(0); for(size_t i=0; i < nbN; i+= N ) { auto v = eve::load(&r[i]); s = eve::fam(s, v, xx); xx*= xc; } if(auto d = sz-nbN) { auto v = eve::loadignore_last(N-d); s = eve::fam(s, v, xx); } auto rs = eve::reduce(s); return rs; } else { return detail::reverse_horner_impl(regular_type(), x, r); } }

jtlap commented 2 years ago

the polynom coeffs are in increasing degree order (this is reverse) it works fine in and rapidly (I bench it) if the the vector containing the coefficients of the polynomial is wider than the current expected wide of the archi. I have no time just now, but I will soon make a branch with the implementation, you will be able to test. As I wanted to write it with the eve:algo I wrote a reduce version with two op parameters. One for the main reduction the other to finish the process. I will put that also and the using this reduce2 version. I noticed that the number of iterations was greater with that version than with the above one

DenisYaroshevskiy commented 2 years ago

eve::algo is for generic stl like algorithms written in a certain style. The code copypaste failed I guess.

I am not sure what's reduce 2. We generally follow the standard algorithms so far.

However, let's see a clean version and then we can figure out an apropriate for algo generalization.

jfalcou commented 2 years ago

Let's move this to discussion