TH1F bin content saturation

stwunsch commented 3 years ago

See for comparison the screenshot below.

The upper plot was done with TTree.Draw:

>>> import ROOT
>>> f = ROOT.TFile('DYJetsToLL.root')
>>> t = f.Get('Events')
>>> t.Draw('GenPart_pdgId')

The lower plot was done with RDataFrame.Histo1D:

>>> import ROOT
>>> c = ROOT.TCanvas()
>>> h = ROOT.RDataFrame('Events', 'DYJetsToLL.root').Histo1D('GenPart_pdgId')
>>> h.Draw()

Screenshot from 2020-10-20 10-38-16

I've used ROOT 6.22/02 and you can download the file here:

http://opendata.web.cern.ch/record/12353

ferdymercury commented 8 months ago

According to https://en.cppreference.com/w/cpp/numeric/math/nextafter,

TH1F stops working well at 1e7 (with integer weights). Should we add this as the maximum value for TH1F, as is with TH1C for 128 e.g. With non-integer weights, this might become more of a problem to check as it is highly dependent on the chosen weight. But usually w = 1.

Precision loss demo for float:
nextafter(1e+01, INF) gives 10.000001; Δ = 0.000001
nextafter(1e+02, INF) gives 100.000008; Δ = 0.000008
nextafter(1e+03, INF) gives 1000.000061; Δ = 0.000061
nextafter(1e+04, INF) gives 10000.000977; Δ = 0.000977
nextafter(1e+05, INF) gives 100000.007812; Δ = 0.007812
nextafter(1e+06, INF) gives 1000000.062500; Δ = 0.062500
nextafter(1e+07, INF) gives 10000001.000000; Δ = 1.000000
nextafter(1e+08, INF) gives 100000008.000000; Δ = 8.000000

I proposed a pull request.

ferdymercury commented 8 months ago

Yes, but why using TH1F ? Everybody should always use TH1D, unless there are some memory issues. I have seen problem like this already too many times

If we want to encourage that change, I think we should start by removing TH1F from all the doxygen examples in ROOT, which is I believe why many people still use TH1F.

If you run a grep, there are almost 2000 results. Most of them in the tutorials and test folders. Others in roofit and tmva.

eguiraud commented 8 months ago

why using TH1F ? Everybody should always use TH1D,

TTree::Draw

vepadulano commented 5 months ago

A summary of the discussion at the linked PR:

We cannot implement a precision loss check in TH*F classes as they are implemented currently, as it would effectively be a no-op and a waste of CPU cycles
We can better document the interfaces highlighting the possibility to incur in undefined behaviour of the type seen in this issue
The only way for a user to know that they are reaching the limits of their histogram is to check the amount of entries and compare it with the maximum integer bin content for the float channel as documented by the changes in the PR
TH1F should probably never be used unless the user really knows what they are doing. Unfortunately it is the default histogram type used by TTree::Draw, rendering its deprecation very complicated
To combat this we should probably remove bare TH1F usage from tutorials everywhere
The difference seen by @stwunsch in his report is due to the fact that RDataFrame uses TH1D whereas TTree::Draw uses the 'malign' TH1F which saturates at 1.67e8

fwyzard commented 5 months ago

We cannot implement a precision loss check in TH*F classes as they are implemented currently, as it would effectively be a no-op ...

Why not ?

You can always check that (value in the bin after fill) - (value in the bin before fill) is reasonably close to the value that was added, and print a warning message otherwise.

... and a waste of CPU cycles

Ah, yes, it would definitely be slower !

ferdymercury commented 5 months ago

You can always check that (value in the bin after fill) - (value in the bin before fill) is reasonably close to the value that was added, and print a warning message otherwise.

Not really. Your suggestion would work well if you only had AddBinContentByOne. But if you have AddBinContentByWeight, then what's "close" becomes non-trivial. In other words, Closeness is a function of Weight. So your limit would depend on Weight. There is no way to ensure that that a user always calls AddBinContentByWeight with the same weight. There is no way to ensure that the user calls uses the same weight for each bin of the histogram.

This would result in different "overflow bin limits" for every bin in the histogram. So it's an ill-posed problem.

I attempted to do this with std::nextafter - current_value comparing it vs weight, but as said, this is completely problematic if you have changing weights.

To me, the only solution is using TH1L where the overflow limit is well defined, and forget about floating precision.

fwyzard commented 5 months ago

Sorry, but I strongly disagree.

TH1F implements Fill(x, w) via AddBinContent(bin, w):

void AddBinContent(Int_t bin, Double_t w) override
{
    fArray[bin] += Float_t (w);
}

If one wants to be warned about overflows, it could be changed to

void AddBinContent(Int_t bin, Double_t w) override
{
    float old = fArray[bin];
    fArray[bin] += Float_t (w);
    float inc = fArray[bin] - old;
    if (inc != (float) w) {  // could be done with a non-exact comparison with some tolerance
      std::cerr << "Warning: TH1F::Fill(...) failed to increment the bin due to limited floating point precision\n";
    }
}

ferdymercury commented 5 months ago

// could be done with a non-exact comparison with some tolerance

Yeah, that's what I meant. Please define a tolerance that scales over order of magnitudes and weights, and that also takes into account clamping and overflows...

fwyzard commented 5 months ago

Sorry, I assumed that would be your job ?

ferdymercury commented 5 months ago

Not my job, I am a volunteer.

vepadulano commented 5 months ago

Re-opening the issue following further discussion. The linked PR is still valid as it documents the current state of the implementation, so that doesn't need to be changed. An investigation into finding a tolerance that can account for different (orders of magnitude of) weights is the next step for this issue. Since it was not foreseen in the PoW for 2024, we cannot give an ETA at this moment.

ferdymercury commented 5 months ago

As alternative ideas:

a new function SafeFill(x, w, tol) where the user can define his tolerance tol.
a static function IsInOverflow(binc, w, tol) that checks whether a histogram bin is in overflow depending on its content and a potential weight w to be added.

From my point of view, I will just go towards TH1D or TH1L and away from TTree::Draw

fwyzard commented 5 months ago

Here is a implementation that may be naive, but I would argue catches the vast majority of the use cases:

constexpr bool compare(float expected, float actual) {
  // most simple and most common case
  if (actual == expected)
    return true;

  // comparison with an arbitrary small tolerance 
  constexpr const float epsilon = std::numeric_limits<float>::epsilon();
  const float delta = std::fabs(expected) * epsilon;
  if ((actual > expected - delta) and (actual < expected + delta))
    return true;

  return false;
}

If any of the arguments (the weight or the actual increment) is NaN or infinite the function should return false, which kind of makes sense in the above context.

ferdymercury commented 5 months ago

With @lmoneta we were discussing in the PR this kind of case:

a histogram with an initial SetBinContent of 1e8, and you add an event with weight 8. This leads to an error

(1e8f+8.01f)- 1e8f - 8.01f = -0.01f

which compared to the bin content of 1e8 is a negligible difference.

But compare(8.01f,8.00f) would return that the increment is not the same.

So we were thinking of defining somehow a relative tolerance. We used std::nextafterf and compared the relative distance wrt the original, and divided by w. But weird things may happen here, because you might call Fill with a negative weight, and the result might come close to zero for some bins, so a relative normalization is also ugly. We would need a compromise somehow between an absolute and a relative normalization for the tolerance, or adding a lot of CPU-wasting checks. Or just focus on the main cases with positive weights.

fwyzard commented 5 months ago

Relative with respect to the bin value (before the increment), or with respect to the increment, or with respect to the "correct" bin value after the increment ?

root-project / root

TH1F bin content saturation #6671