add continued fraction exponential

gridley commented 4 years ago

So I stumbled across the interesting fact the other day that you can rearrange any rational polynomial into a continued fraction by recursively doing polynomial long division.

Got an 8% speedup on C5G7 on the transport sweeps by re-arranging the algebra of the rational approximation to the exponential. This is fast than using Horner's method on the numerator and denominator separately (as is currently done).

I tested the accuracy of the new method (it should be identical to Josey's Remez results aside from numerical evaluation error) and it's indistinguishable.

Similar stuff could be done for the other rational approximations!

Also, there's apparently some assembly instruction for an approximate reciprocal (which is likely acceptable accuracy to our rational approximation which is approximate as-is). This would be stupid fast if someone wrote the inline assembly to use that instruction.

On intel CPUs it should be "VRCP14PD" or "VRCP14PS" for SIMD reciprocals. I don't think any compiler will ever put these in your code since they're more meant for some lower level implementation of calculating square roots with newton's method and whatever else.

GiudGiud commented 4 years ago

This is pretty cool. I'll test it on some BEAVRS 70g cases. Could you please upload a plot of the absolute and the relative error compared to either the rational approximation or the expF1 function ? Pick a range of like [0, 1000].

We may want to redesign the test suite to allow a little more tolerance

gridley commented 4 years ago

OK, this makes sense why the tests failed. I compared the absolute value of the difference between (1-exp(-x))/x for the original and new F1 term calculations. Double precision is identical, but accuracy suffers in single precision. There's probably a way around this...

Here is a plot of the error for double precision. float_error_compare

And for single precision: $float_continued_fraction_error$

Dangit. Well, that is a shame to see this. I didn't test speedup in double precision. When I was testing the function on its own, the double precision version outperformed the original version even moreso compared to the single precision comparison. I'll check C5G7 speedup in double precision at some point, although it's not like anyone is that interested in running double precision at this point.

gridley commented 4 years ago

OK, and here is the absolute error in (1-exp(-x)). I think it's perhaps appropriate. Seems to be maxing out at 7.9e-6. Up to you if that's acceptable @GiudGiud.

continued_absolute_error

gridley commented 4 years ago

OK, so, exact timing results on the exponential evaluation: this is more than twice as fast for computing exponentials in single precision. Here are my results. One noteworthy thing is that you don't get bigtime performance gains until you slap on some optimization flags. I think the compiler is rearranging these expressions in a rather interesting way. Unfortunately I'm not quite good enough at reading assembly to tell what it's doing exactly.

gavin@gpad:~/scratch$ g++ -O4 -ffast-math optimize_rational.C 
gavin@gpad:~/scratch$ ./a.out 
Baseline version took 1.1e-07s
Revised version took 5.2e-08s

And for double precision, it is simply epic. Three times as fast.

Baseline version took 1.43e-07s
Revised version took 4.8e-08s

Here's the code I used to measure this.

#define FP_PRECISION double
#define INLINE
#include <chrono>
#include <cmath>
#include <iostream>
#include <fstream>
#include <vector>

// constexpr FP_PRECISION max_float = (FP_PRECISION)RAND_MAX;

INLINE FP_PRECISION randf() {
  return (FP_PRECISION)rand()/(FP_PRECISION)RAND_MAX * 10.0;
}

INLINE void expF1_fractional(FP_PRECISION x, FP_PRECISION* expF1) {
  const FP_PRECISION p0 = 1.0;
  const FP_PRECISION p1 = 2.4172687328033081 * 1E-1;
  const FP_PRECISION p2 = 6.2804790965268531 * 1E-2;
  const FP_PRECISION p3 = 1.0567595009016521 * 1E-2;
  const FP_PRECISION p4 = 1.0059468082903561 * 1E-3;
  const FP_PRECISION p5 = 1.9309063097411041 * 1E-4;
  const FP_PRECISION d0 = 1.0;
  const FP_PRECISION d1 = 7.4169266112320541 * 1E-1;
  const FP_PRECISION d2 = 2.6722515319494311 * 1E-1;
  const FP_PRECISION d3 = 6.1643725066901411 * 1E-2;
  const FP_PRECISION d4 = 1.0590759992367811 * 1E-2;
  const FP_PRECISION d5 = 1.0057980007137651 * 1E-3;
  const FP_PRECISION d6 = 1.9309063097411041 * 1E-4;

  FP_PRECISION num, den;

  den = d6*x + d5;
  den = den*x + d4;
  den = den*x + d3;
  den = den*x + d2;
  den = den*x + d1;
  den = den*x + d0;
  den = 1.f / den;

  num = p5*x + p4;
  num = num*x + p3;
  num = num*x + p2;
  num = num*x + p1;                                                                                                                                                                                                
  num = num*x + p0;                                                                                                                                                                                                

  *expF1 = num * den;
} 

INLINE void expF1_continued(FP_PRECISION x, FP_PRECISION* expF1) {
  FP_PRECISION t1 = -0.0007706618174095522 + x;
  FP_PRECISION t2 = 430.4417106253813 + 8.065530110666254*x;
  FP_PRECISION t3 = -0.0014907526842255446 + 0.00007959230164620542*x;
  FP_PRECISION t4 = -909.733382360302 + 72.8109059813554*x;
  FP_PRECISION t5 = -0.002868787847637995 + 0.00030789156651464197*x;
  FP_PRECISION t6 = -1527.156986843306 + 200.5065021575783*x;
  *expF1 = 1.f/(t1 + 1.f/(t2 + 1.f/(t3 + 1.f/(t4 + 1.f/(t5 + 1.f/t6)))));
}

// Test the baseline F1 exponential
double benchmark(int ntimes) {
  FP_PRECISION result;
  std::chrono::duration<double> total_time(0);
  std::vector<FP_PRECISION> test_x(ntimes);

  for (int i=0; i<ntimes; ++i) {
    test_x[i] = randf();
  }

  auto start = std::chrono::high_resolution_clock::now();
  for(int i=0; i<ntimes; ++i) {
    expF1_fractional(test_x[i], &result);
  }
  auto end = std::chrono::high_resolution_clock::now();

  total_time = end-start;
  return total_time.count();
}

// Test the modified F1
double revised(int ntimes) {
  FP_PRECISION result;

  std::chrono::duration<double> total_time(0);

  std::vector<FP_PRECISION> test_x(ntimes);
  for (int i=0; i<ntimes; ++i) {
    test_x[i] = randf();
  }

  auto start = std::chrono::high_resolution_clock::now();
  for(int i=0; i<ntimes; ++i) {
    expF1_continued(test_x[i], &result);
  }
  auto end = std::chrono::high_resolution_clock::now();

  total_time = end-start;
  return total_time.count();
}

int main() {
  int ntimes = 200000000;

  double diff1 = benchmark(ntimes);
  std::cout << "Baseline version took " << diff1 << "s" << std::endl;

  double diff2 = revised(ntimes);
  std::cout << "Revised version took " << diff2 << "s" << std::endl;

  // Calculate error of each version:
  std::cout << "Writing out errors of each function:" << std::endl;
  std::ofstream ff;
  ff.open("rational_appr_err.txt");
  FP_PRECISION dx = 1e-8;
  FP_PRECISION x = dx;
  FP_PRECISION result;
  FP_PRECISION truevalue;
  while (x < 1e2) {
    truevalue = (1-exp(-x))/x;
    ff << x << " ";
    expF1_fractional(x, &result);
    ff << std::abs(result-truevalue) << " ";
    expF1_continued(x, &result);
    ff << std::abs(result-truevalue) << std::endl;
    x += dx;
    dx *= 1.01;
  }
}

GiudGiud commented 4 years ago

Double precision matters mostly for the test suite at this point. It's odd that the tests are failing with that close of an agreement.

In single precision, close to 1 pcm errors would be fine for very short or very long rare optical lengths, but the domain where it happens is right where most segments are.

To take this further, since a factor of 2 on the exponential is worth it, I would recommend you fit the coefficients of the continued fraction directly to F1. findFit in mathematica is pretty fast & easy to use. This should give you a more even fit. This will also allow you to try one less or one more terms, to adjust the precision if needed.

gridley commented 4 years ago

I wrote this to try to anneal the parameters to reduce the max error. Nothing like some good ol' black box optimization. I was able to drop the error like 50% with a 2 minute annealing run. I'm going to turn down the cooling rate, and let this puppy run about 2 hours on the cluster. Feel free to play with it. Ahhh... and I can make the random seed something different on the command line and run multiple of these at once...

#include <array>
#include <cmath>
#include <iostream>
#include <chrono>
void expF1_continued(float x, float* expF1, std::array<float, 11> const& params) {
  float t1 = params[0]  + x;
  float t2 = params[1] + params[2]*x;
  float t3 = params[3] + params[4]*x;
  float t4 = params[5] + params[6]*x;
  float t5 = params[7] + params[8]*x;
  float t6 = params[9] + params[10]*x;
  *expF1 = 1.f/(t1 + 1.f/(t2 + 1.f/(t3 + 1.f/(t4 + 1.f/(t5 + 1.f/t6)))));
}

// Calculates max error of the continued fraction over [1e-8,100]
float get_max_error(std::array<float,11>& params) {
  float dx = 1e-8;
  float x = dx;
  float result;
  float truevalue;
  float max_err = 0;
  while (x < 1e2) {
    truevalue = (1-exp(-x))/x;
    expF1_continued(x, &result, params);
    double this_err = std::abs(result-truevalue);
    if (this_err > max_err) max_err = this_err;
    x += dx;
    dx *= 1.005;
  }
  return max_err;
}

inline float randf() {
  return (float)rand()/(float)RAND_MAX;
}

int main(int argc, char* argv[]) {
  srand(1);

  // read settings from command line
  if (argc != 4) {
    std::cerr << "Incorrect number of command line arguments" << std::endl;
    exit(1);
  }
  float T = std::stof(argv[1]); // high temperature
  float coolrate = std::stof(argv[2]);
  float fractional_jmp = std::stof(argv[3]);

  std::array<float, 11> params_final;
  std::array<float, 11> new_params;

  params_final[0] = -0.0007706618174095522;
  params_final[1] = 430.4417106253813;
  params_final[2] =8.065530110666254;
  params_final[3] =-0.0014907526842255446;
  params_final[4] =0.00007959230164620542;
  params_final[5] =-909.733382360302;
  params_final[6] =72.8109059813554;
  params_final[7] =-0.002868787847637995;
  params_final[8] =0.00030789156651464197;
  params_final[9] =-1527.156986843306;
  params_final[10] =200.5065021575783;

  std::cout << "Starting max error:" << std::endl;
  float current_error = get_max_error(params_final);
  std::cout << current_error << std::endl;
  float Tfinal = 1e-7;

  // Approximate the number of iterations
  float niter = std::log(Tfinal/T)/std::log(1-coolrate);
  int thisiter = 0;

  auto start = std::chrono::high_resolution_clock::now();

  while (T > Tfinal) {

    // Generate new possible set of parameters
    new_params = params_final;
    #pragma unroll
    for (int i=0; i<new_params.size(); ++i) {

      // Generate quasi-normal distribution sample
      float kinda_normal_sample = 0;
      kinda_normal_sample += randf();
      kinda_normal_sample += randf();
      kinda_normal_sample += randf();
      kinda_normal_sample -= 1.5;

      new_params[i] += kinda_normal_sample * fractional_jmp * new_params[i];
    }

    // get new max error:
    float newerror = get_max_error(new_params);

    if (newerror < current_error) {
      current_error = newerror;
      params_final = new_params;
    }
    else if (std::exp(-(newerror-current_error)/T) > randf()) {
      current_error = newerror;
      params_final = new_params;
    }

    T -= T*coolrate;
    thisiter++;
    std::cout << T << " " << current_error << std::endl;

    if (thisiter == 1000) {
      auto stop = std::chrono::high_resolution_clock::now();
      std::chrono::duration<float> elapsed = stop - start;
      float time_per_iteration = elapsed.count() / 1000;
      float time_remaining = niter * time_per_iteration;
      std::cout << "Approximately " << time_remaining << "s remaining. Continue? (y/n)" << std::endl;
      char yn;
      std::cin >> yn;
      if (yn != 'y') break;
    }
  }

  std::cout << "Annealed parameters:" << std::endl;
  for (int i=0; i<params_final.size(); ++i) {
    std::cout << params_final[i] << std::endl;
  }
}

GiudGiud commented 4 years ago

Sounds like a good idea. Keep in mind that (1 - exp(-x)) / x using the standard library is not accurate near 0

gridley commented 4 years ago

Well, I ended up shifting some stuff in x to hopefully move all of the coefficients into the same order of magnitude or so. It unfortunately only can bring max error down to like 3.5e-5. I'm going to have to abandon this at this point, even if it is promising... Shucks. Maybe we could leave this PR open in case anybody figures out a more numerically stable way to evaluate this expression.

INLINE void expF1_continued(FP_PRECISION x, FP_PRECISION* expF1) {
  FP_PRECISION t1 = -0.0007706618174095522 + x;
  FP_PRECISION t2 = 0.02486121577471738 + 0.019203643120633936*x;
  FP_PRECISION t3 = -0.6261161273747288 + 0.03342876669140628*x;
  FP_PRECISION t4 = -2.166031862762624 + 0.17335929995560811*x;
  FP_PRECISION t5 = -1.204890896007958 + 0.12931445793614962*x;
  FP_PRECISION t6 = -3.6360880639126334 + 0.47739643370851975*x;
  FP_PRECISION b = t5*t6+1.f;
  FP_PRECISION c = t3 + b/(t4*b+t6);
  FP_PRECISION d = 420.f*(t2+(c+1.f)/c);
  *expF1 = d/(t1*d+1.0f);
}

gridley commented 4 years ago

Closing this because I have played with this quite a bit and do not think it can be made numerically stable.

mit-crpg / OpenMOC

add continued fraction exponential #458