[RF] `RooStats::HistFactory::FlexibleInterpVar` failing evaluation

Zeff020 commented 1 year ago

Check duplicate issues.

[ ] Checked for duplicates

Description

Since recently we have been seeing that on ROOT master a simple workspace with a single RooStats::HistFactory::FlexibleInterpVar component fails already just on evaluation, whereas it works in ROOT 6.28.04. The workspace is attached, and the code to reproduce the error included in the reproducer.

Reproducer

unzip workspace
root -l failWS_release.root
fail->Print()
See: "message : function value is NAN"

Please find here the workspace: failWS_release.zip

ROOT version

Current ROOT master gives the error. ROOT 6.28.04 does not.

Installation method

build from source

Operating system

Linux centos7

Additional context

The PR that introduced the problem: https://github.com/root-project/root/pull/13067

egpbos commented 1 year ago

I briefly tried looking into this. I put some prints in FlexibleInterpVar::evaluate just to see the values. Without having any idea what FlexibleInterpVar should be doing because documentation is lacking, the first thing that strikes me as odd is that, in the evaluation that gives a NaN, the "low" value is higher than the "high" value. Is this expected?

egpbos commented 1 year ago

I also tried a partial "revert" (manually) of earlier commit https://github.com/root-project/root/commit/466f3f689c578cb53d75ddeeb04472ec4d82e3ed which changed the 6th order interpolation scheme (which is used in this failing example), uncommenting the commented out code here (and commenting the uncommented code):

inline double interpolate6thDegree(double x, double low, double high, double nominal, double boundary)
{
   double t = x / boundary;
   double eps_plus = high - nominal;
   double eps_minus = nominal - low;
   double S = 0.5 * (eps_plus + eps_minus);
   double A = 0.0625 * (eps_plus - eps_minus);

   return x * (S + t * A * (15 + t * t * (-10 + t * t * 3)));

//      double pow_up = std::pow(high / nominal, boundary);
//      double pow_down = std::pow(low / nominal, boundary);
//      double logHi = std::log(high);
//      double logLo = std::log(low);
//      double pow_up_log = high <= 0.0 ? 0.0 : pow_up * logHi;
//      double pow_down_log = low <= 0.0 ? 0.0 : -pow_down * logLo;
//      double pow_up_log2 = high <= 0.0 ? 0.0 : pow_up_log * logHi;
//      double pow_down_log2 = low <= 0.0 ? 0.0 : -pow_down_log * logLo;
//
//      double S0 = (pow_up + pow_down) / 2;
//      double A0 = (pow_up - pow_down) / 2;
//      double S1 = (pow_up_log + pow_down_log) / 2;
//      double A1 = (pow_up_log - pow_down_log) / 2;
//      double S2 = (pow_up_log2 + pow_down_log2) / 2;
//      double A2 = (pow_up_log2 - pow_down_log2) / 2;
//
//      // cache  coefficient of the polynomial
//      double p0 = 1. / (8 * boundary) * (15 * A0 - 7 * boundary * S1 + boundary * boundary * A2);
//      double p1 = 1. / (8 * boundary * boundary) * (-24 + 24 * S0 - 9 * boundary * A1 + boundary * boundary * S2);
//      double p2 = 1. / (4 * std::pow(boundary, 3)) * (-5 * A0 + 5 * boundary * S1 - boundary * boundary * A2);
//      double p3 = 1. / (4 * std::pow(boundary, 4)) * (12 - 12 * S0 + 7 * boundary * A1 - boundary * boundary * S2);
//      double p4 = 1. / (8 * std::pow(boundary, 5)) * (+3 * A0 - 3 * boundary * S1 + boundary * boundary * A2);
//      double p5 = 1. / (8 * std::pow(boundary, 6)) * (-8 + 8 * S0 - 5 * boundary * A1 + boundary * boundary * S2);
//
//   return 1. + x * (p0 + x * (p1 + x * (p2 + x * (p3 + x * (p4 + x * p5)))));
}

This still gives a NaN for this workspace, though, so this doesn't seem to be the issue.

egpbos commented 1 year ago

Oops, the above wasn't enough for the "revert", I also had to replace the calling site (in flexibleInterp) with

//      return total * std::exp(interpolate6thDegree(x, std::log(low), std::log(high), std::log(nominal), boundary));
      return total * interpolate6thDegree(x, low, high, nominal, boundary);

This (together with using the old code as commented out above) indeed gets rid of the nan!

I think the crucial part is that the old code checks for high <= 0 and also low <= 0.

Indeed, I traced the nan there: the sequence of events in this failing example is that high is zero, so that std::log(high) is -inf. Then eps_plus is also -inf and so S and A too, so you -inf times something negative (t happens to be negative in this example) so the term becomes +inf and eventually you get a -inf + inf and that equals nan.

So, there's at least two options to fix this, both will need some conditionals added in again.

Going back to the original algorithm. If I understand the commit message of that change correctly, it was mostly removed because it required cached variables. As I rewrote the algorithm above, no more cached variables are necessary. Of course, it does involve more temporary values than the new algorithm, so perhaps this is another reason to pick the new algorithm.
Adding similar <= 0 checks to the new algorithm. I think this should take place at the call site, because it doesn't make sense for the other interpolate6thDegree usecase where no weird values are expected from logarithms.

By the way, why is the interpolation taking place in logarithmic space instead of regular linear space and then transformed back into linear space after interpolation? Doesn't this also mean that the two interpolation schemes (FlexibleInterpVar and PiecewiseInterpolation) are again inconsistent? If the logarithm can be eliminated, I think we can do without introducing conditionals, no? That is, assuming the high and low vectors are filled sensibly, which is not checked anywhere as far as I can see, so that's still a minor weakness, but I guess that could easily be added in the documentation.

egpbos commented 1 year ago

I tried out option 3 and it gives pretty good results, using your comparison method as documented in your commit message here https://github.com/root-project/root/commit/466f3f689c578cb53d75ddeeb04472ec4d82e3ed:

          value_old     value_new       diff_abs   diff_rel
param
-1.00  0.9000000000  0.9000000000   0.0000000000   0.00E+00
-0.95  0.9047424331  0.9049857073   0.0002432742   2.69E-04
-0.85  0.9140836603  0.9146805378   0.0005968775   6.53E-04
-0.75  0.9230659064  0.9237960815   0.0007301751   7.91E-04
-0.65  0.9317147727  0.9323668386   0.0006520658   7.00E-04
-0.55  0.9402402768  0.9406592464   0.0004189696   4.46E-04
-0.45  0.9489635592  0.9490771800   0.0001136208   1.20E-04
-0.35  0.9582539443  0.9580809519  -0.0001729925  -1.81E-04
-0.25  0.9684763536  0.9681198120  -0.0003565416  -3.68E-04
-0.15  0.9799490728  0.9795779479  -0.0003711248  -3.79E-04
-0.05  0.9929118716  0.9927339847  -0.0001778869  -1.79E-04
 0.05  1.0075044778  1.0077339847   0.0002295068   2.28E-04
 0.15  1.0237554038  1.0245779479   0.0008225442   8.03E-04
 0.25  1.0415811263  1.0431198120   0.0015386857   1.48E-03
 0.35  1.0607956204  1.0630809519   0.0022853314   2.15E-03
 0.45  1.0811302458  1.0840771800   0.0029469342   2.73E-03
 0.55  1.1022639865  1.1056592464   0.0033952598   3.08E-03
 0.65  1.1238640449  1.1273668386   0.0035027937   3.12E-03
 0.75  1.1456367871  1.1487960815   0.0031592944   2.76E-03
 0.85  1.1673890440  1.1696805378   0.0022914938   1.96E-03
 0.95  1.1890997634  1.1899857073   0.0008859439   7.45E-04
 1.00  1.2000000000  1.2000000000   0.0000000000   0.00E+00

The difference is slightly worse, half an order of magnitude I'd guesstimate, but perhaps this is not such a big deal.

The implementation that produced these "new" results:

It uses your interpolate6thDegree.
On the call site in flexibleInterp, I removed the logs, the exp and I added a 1 +, which is also done in the old algorithm before returning. Indeed, I saw that when I did not add it, the absolute difference of the first results was exactly -1. So, in the end the calling line becomes return total * (1 + interpolate6thDegree(x, low, high, nominal, boundary));.

No conditionals needed and the results match the old ones pretty well still.

guitargeek commented 1 year ago

Thanks a lot for the investigation! Option 3 in your recent comment sounds like the ideas solution to me! And it indeed fixes your problem with the ATLAS workspaces? Can you make a PR with that? I think it's the good solution without reverting https://github.com/root-project/root/pull/13067.

egpbos commented 1 year ago

Yes, I'll make a PR 👍

guitargeek commented 1 year ago

In the end, this problem is fixed by the following PR:

https://github.com/root-project/root/pull/13427

The PR restores the old behavior for now. All these striking inconsistencies will be fixed in a follow-up PR.

root-project / root