root-project / root

The official repository for ROOT: analyzing, storing and visualizing big data, scientifically
https://root.cern
Other
2.61k stars 1.25k forks source link

TH1F bin content saturation #6671

Open stwunsch opened 3 years ago

stwunsch commented 3 years ago

See for comparison the screenshot below.

The upper plot was done with TTree.Draw:

>>> import ROOT
>>> f = ROOT.TFile('DYJetsToLL.root')
>>> t = f.Get('Events')
>>> t.Draw('GenPart_pdgId')

The lower plot was done with RDataFrame.Histo1D:

>>> import ROOT
>>> c = ROOT.TCanvas()
>>> h = ROOT.RDataFrame('Events', 'DYJetsToLL.root').Histo1D('GenPart_pdgId')
>>> h.Draw()

Screenshot from 2020-10-20 10-38-16

I've used ROOT 6.22/02 and you can download the file here:

http://opendata.web.cern.ch/record/12353

bellenot commented 3 years ago

@stwunsch which one is correct?

stwunsch commented 3 years ago

Good question! I guess the lower one, also checked by counting only the respective entries in the GenPart_pdgId vector.

stwunsch commented 3 years ago

Well, it's a DY sample, so physically spoken we would expect the upper one (?) Let's say, it's not so clear what's going on :sweat_smile:

eguiraud commented 3 years ago

Can you enforce the same exact binning in both and also try with raw TTree and raw TTreeReader? If RDF or TTree is producing wrong histograms it's a critical issue if not a release blocker.

stwunsch commented 3 years ago

@eguiraud Do we agree that the code above should produce (in principle) the same output?

stwunsch commented 3 years ago

@eguiraud Here you can find the plots produced with the same binning.

>>> import ROOT
>>> f = ROOT.TFile('DYJetsToLL.root')
>>> t = f.Get('Events')
>>> h = ROOT.TH1F('h', 'h', 100, -20, 20)
>>> c = ROOT.TCanvas()
>>> t.Draw('GenPart_pdgId>>h')
>>> import ROOT
>>> c = ROOT.TCanvas()
>>> h = ROOT.RDataFrame('Events', 'DYJetsToLL.root').Histo1D(('h', 'h', 100, -20, 20), 'GenPart_pdgId')
>>> h.Draw()

Screenshot from 2020-10-20 11-33-43

stwunsch commented 3 years ago

As a cross check: The manual count in RDF of the entries we see in the histogram.

>>> ROOT.RDataFrame('Events', 'DYJetsToLL.root').Define('n', 'Sum(GenPart_pdgId == 11)').Sum('n').GetValue()
36148151.0
>>> ROOT.RDataFrame('Events', 'DYJetsToLL.root').Define('n', 'Sum(GenPart_pdgId == 13)').Sum('n').GetValue()
25642431.0
>>> ROOT.RDataFrame('Events', 'DYJetsToLL.root').Define('n', 'Sum(GenPart_pdgId == 15)').Sum('n').GetValue()
18289536.0
eguiraud commented 3 years ago

Do we agree that the code above should produce (in principle) the same output?

I don't see why not. We really need an authoritative answer as to which of the plots is correct though. Maybe raw TTree and TTreeReader can act as tie-breaker. There are no missing entries (the number of entries in the two histograms is the same) so it seems that one of the two interfaces is reading wrong values -- but how can it be that histogram mean and variance are exactly the same in the two cases? The mean, I can understand, but the variance should be larger for the upper histo than for the lower histo.

stwunsch commented 3 years ago

Here a reproducer with my best raw TTree skills:

void test() {
    auto f = TFile::Open("DYJetsToLL.root");
    auto t = f->Get<TTree>("Events");
    UInt_t n;
    Int_t x[100];
    t->SetBranchAddress("nGenPart", &n);
    t->SetBranchAddress("GenPart_pdgId", x);
    auto n11 = 0u;
    auto n13 = 0u;
    auto n15 = 0u;
    for(auto i = 0u; i < t->GetEntries(); i++) {
        t->GetEntry(i);
        for (auto j = 0u; j < n; j++) {
            if (x[j] == 11) n11++;
            if (x[j] == 13) n13++;
            if (x[j] == 15) n15++;
        }
    }
    cout << "n11: " << n11 << endl;
    cout << "n13: " << n13 << endl;
    cout << "n15: " << n15 << endl;
}
n11: 36148151
n13: 25642431
n15: 18289536

So we can conclude that's a plotting issue in TTree::Draw? And I think RDataFrame does the right thing.

stwunsch commented 3 years ago

Another ingredient, tree.Draw('GenPart_pdgId[0]') gives a correct plot:

Screenshot from 2020-10-20 12-25-09

eguiraud commented 3 years ago

Ok, thanks for investigating. I think it's safe to treat it as a TTree::Draw bug unless @pcanal figures out otherwise.

eguiraud commented 3 years ago

Ah, one last thing: could it be there are that many points that are right at the bin edges? it would explain why mean and variance are the same in the two cases. Can you try with a less regular or much more fine-grained binning?

stwunsch commented 3 years ago

If the edge would be the issue, we would see more bins filled than the six we have in the plot, right? The second round of plots uses a equidistant binning with 100 bins between -20 and 20.

stwunsch commented 3 years ago

This should be the used bins:

>>> import numpy as np
>>> np.linspace(-20, 20, 101)
array([-20. , -19.6, -19.2, -18.8, -18.4, -18. , -17.6, -17.2, -16.8,
       -16.4, -16. , -15.6, -15.2, -14.8, -14.4, -14. , -13.6, -13.2,
       -12.8, -12.4, -12. , -11.6, -11.2, -10.8, -10.4, -10. ,  -9.6,
        -9.2,  -8.8,  -8.4,  -8. ,  -7.6,  -7.2,  -6.8,  -6.4,  -6. ,
        -5.6,  -5.2,  -4.8,  -4.4,  -4. ,  -3.6,  -3.2,  -2.8,  -2.4,
        -2. ,  -1.6,  -1.2,  -0.8,  -0.4,   0. ,   0.4,   0.8,   1.2,
         1.6,   2. ,   2.4,   2.8,   3.2,   3.6,   4. ,   4.4,   4.8,
         5.2,   5.6,   6. ,   6.4,   6.8,   7.2,   7.6,   8. ,   8.4,
         8.8,   9.2,   9.6,  10. ,  10.4,  10.8,  11.2,  11.6,  12. ,
        12.4,  12.8,  13.2,  13.6,  14. ,  14.4,  14.8,  15.2,  15.6,
        16. ,  16.4,  16.8,  17.2,  17.6,  18. ,  18.4,  18.8,  19.2,
        19.6,  20. ])

The values -15, -13, -11, 11, 13 and 15 are not on the edges.

eguiraud commented 3 years ago

Fair enough :smile:

eguiraud commented 3 years ago

An interesting observation: if you look at the bars in the two plots, for the RDF plot you can estimate the total number of entries as 37e6*2 + 26e6*2 + 18e6*2 = 1.62e8 which is consistent with the number of entries reported by the histogram. For the TTree::Draw plot you get, more or less, 16.75e6 * 6 = 1.01e8: why are the other entries not displayed? Did they end up in the overflow/underflow bins?

stwunsch commented 3 years ago

No underflows or overflows:

>>> import ROOT
>>> f = ROOT.TFile('DYJetsToLL.root')
>>> t = f.Get('Events')
>>> c = ROOT.TCanvas()
>>> h = ROOT.TH1F('h', 'h', 100, -20, 20)
>>> t.Draw('GenPart_pdgId>>h')
>>> h.GetBinContent(0)
0.0
>>> h.GetBinContent(21)
0.0
>>> h.Integral()
100663296.0

@eguiraud But have a look at the integral! Indeed, your estimation is fully correct.

stwunsch commented 3 years ago

Last round of investigations, here the Events->Print() output:

*............................................................................*
*Br   62 :nGenPart  : nGenPart/i                                             *
*Entries : 30458871 : Total  Size=  121869654 bytes  File Size  =   27764921 *
*Baskets :      349 : Basket Size=     574464 bytes  Compression=   4.39     *
*............................................................................*
*Br   67 :GenPart_pdgId : GenPart_pdgId[nGenPart]/I                          *
*Entries : 30458871 : Total  Size=  767472277 bytes  File Size  =  175369010 *
*Baskets :      698 : Basket Size=    1738752 bytes  Compression=   4.38     *
*............................................................................*

And the issue is also visible in TBrowser by clicking on the branch:

Screenshot from 2020-10-20 13-26-28

eguiraud commented 3 years ago

image

pcanal commented 3 years ago

Running TTree::Scan should give an idea of the type of error made by TTree::Draw.

stwunsch commented 3 years ago

TTree::Scan looks fine, I guess.

$ root -l DYJetsToLL.root 
root [0] 
Attaching file DYJetsToLL.root as _file0...
(TFile *) 0x55a4bb156060
root [1] Events->Scan("GenPart_pdgId")
***********************************
*    Row   * Instance * GenPart_p *
***********************************
*        0 *        0 *        15 *
*        0 *        1 *        11 *
*        0 *        2 *       -15 *
*        0 *        3 *        11 *
*        0 *        4 *       -11 *
*        0 *        5 *        11 *
*        0 *        6 *        15 *
*        0 *        7 *       -11 *
*        0 *        8 *        15 *
*        0 *        9 *        11 *
*        0 *       10 *        15 *
*        0 *       11 *       -15 *
*        0 *       12 *        15 *
*        0 *       13 *        15 *
*        0 *       14 *        11 *
*        0 *       15 *        11 *
*        0 *       16 *        15 *
*        0 *       17 *        15 *
*        0 *       18 *        11 *
*        1 *        0 *       -13 *
*        1 *        1 *       -13 *
*        1 *        2 *       -11 *
*        1 *        3 *        13 *
*        1 *        4 *        11 *
*        1 *        5 *        13 *
eguiraud commented 3 years ago

@pcanal ping (this is marked as critical)

eguiraud commented 3 years ago

Found a smoking gun with @stwunsch : given a file with more than 16777216 entries, let's say, 50e6:

>>> ROOT.RDataFrame(50000000).Define("x", "1").Snapshot("t", "f.root")

TTree::Draw caps the contents of each bin at 16777216 (note in particular that h->GetEntries() returns the correct number, 50e6, but the bin contents are wrong:

root [1] t->Draw("x>>h")
Info in <TCanvas::MakeDefCanvas>:  created default TCanvas with name c1
root [2] h->GetEntries()
(double) 50000000.
root [3] h->GetNbinsX()
(int) 3
root [4] h->GetBinContent(2)
(double) 16777216.
root [5] h->GetBinContent(0)
(double) 0.0000000
root [6] h->GetBinContent(1)
(double) 0.0000000
root [7] h->GetBinContent(3)
(double) 0.0000000

@pcanal help

pcanal commented 3 years ago

So, if you run on a few entries, the result are consistent between RDataFrame and TTree::Draw?

eguiraud commented 3 years ago

Yes

eguiraud commented 3 years ago

Note that the TTree in the last repro is really nothing special, it just has one integer (scalar) branch with all values equal to 1

pcanal commented 3 years ago

Is the histogram of the same type? (TH1I vs TH1F)

eguiraud commented 3 years ago

in the snippet above, h is a TH1F

pcanal commented 3 years ago

Then it definitively weird :( ... Would have to see whether the cutoff is down in TSelectorDraw itself or somehow in the histogram ...

Axel-Naumann commented 3 years ago

Simple repo:

void repo() {
  int i = 10000;
  TTree t("t", "title");
  t.Branch("b", &i, "i/I");
  for (int e = 0; e < 50000000; ++e) {
    t.Fill();
  }
  t.Draw("b.i");
  std::cout << ((TH1*)gDirectory->FindObject("htemp"))->GetMaximum() << '\n';
}

Prints 1.67772e+07 and really shouldn't.

stwunsch commented 3 years ago

Here is a plain C++ reproducer, without ROOT ;)

#include <iostream>

int main() {
    float x = 0.f;
    for (int i = 0u; i < 50000000; i++) {
        x += double(1.0);
    }
    std::cout << x << std::endl;
    // returns the magic number 1.67772e+07
}

The culprit in ROOT is precisely here: https://github.com/root-project/root/blob/master/tree/treeplayer/src/TSelectorDraw.cxx#L1361

Edit: More precisely, it's filling > 2e7 double(1.0) into a TH1F bin, same than the C++ only reproducer above. Edit: We don't see this in RDataFrame because it's using TH1Ds.

eguiraud commented 3 years ago

Turns out this is a duplicate of https://sft.its.cern.ch/jira/browse/ROOT-2956

Axel-Naumann commented 3 years ago

@lmoneta given that we didn't understand this, and that this is likely not obvious to our users, could we add a test to AddBinContent() of the form:

if (w != 0 && bincontent + w == bincontent)
   Warning("AddBinContent", "Bin too full, adding %d is below its precision and does not change the bin content!", w);
lmoneta commented 3 years ago

Yes, but why using TH1F ? Everybody should always use TH1D, unless there are some memory issues. I have seen problem like this already too many times

stwunsch commented 3 years ago

From my experience, everyone is using TH1Fs because the datatype in most ntuple is float and not double (there because of the size). Most users may think that if the branch type is float, you should also use a TH1F. Not necessarily correct, but I can guarantee you that this happens a lot.

And more important, especially for NanoAOD, the following very typical user code will show a wrong result, on rather small files (see the initial reproducer):

>>> import ROOT
>>> f = ROOT.TFile('DYJetsToLL.root')
>>> t = f.Get('Events')
>>> t.Draw('GenPart_pdgId')

The bare minimum would fixing the type of histogram, that TTree::Draw returns (TH1D, not TH1F), I guess.

lmoneta commented 3 years ago

This is completely wrong, an histogram type has nothing to do with the data type. The axis binning is always used in double precision. We should probably then specify this better in the TH1 documentation!

lmoneta commented 3 years ago

And yes TTree::Draw should return a TH1D. @pcanal, can this be changed ? RDataFrame.Histo1D returns TH1D already

stwunsch commented 3 years ago

This is completely wrong, an histogram type has nothing to do with the data type. The axis binning is always used in double precision. We should probably then specify this better in the TH1 documentation!

And that's not how ROOT histograms are used ;)

stwunsch commented 3 years ago

image

https://root.cern/doc/master/classTH1.html#associated-errors

I also would have understood the docs as the precision on the value that you fill in rather than the precision on the bin content. And it's also not a precision in that sense, it's a maximum bin content you can achieve by doing small increments. Actually, in principle you could have larger bin contents with float, e.g., with large weights.

pcanal commented 3 years ago

And yes TTree::Draw should return a TH1D. @pcanal, can this be changed ?

This is straightforward to change (in TSelectorDraw), it is the obvious solution to the problem (albeit it actually just delay the problem per se .. but by a lot I guess :) ) but is not without visible consequences:

Axel-Naumann commented 3 years ago

Forget it: 🦔 [Edit: ouch, not hedgehog but groundhog! You know, groundhog day. Whatever. 😅 ]

Hey, what about https://github.com/root-project/root/issues/6671#issuecomment-791443564 ?

stwunsch commented 3 years ago

What's the performance impact of https://github.com/root-project/root/issues/6671#issuecomment-791443564? I guess that's the question. Otherwise, I'd be all in!

ferdymercury commented 3 years ago

What's the performance impact of #6671 (comment)? I guess that's the question. Otherwise, I'd be all in!

A warning would be helpful indeed, but maybe only for the first time this happens? Just to avoid getting 1000 lines printed out in the terminal.

Just as a side note, there is another TTree.Draw bug out there concerning double (or ULong) precision / failing silently as in this issue: https://sft.its.cern.ch/jira/browse/ROOT-8009

Axel-Naumann commented 3 years ago

A warning would be helpful indeed, but maybe only for the first time this happens?

Certainly - but the perf impact we're concerned about is the if in the case where no diagnostic is emitted, not the subsequent warning.

Axel-Naumann commented 3 years ago

Removing v6.24 as milestone: while I'd be happy to have this addressed in v6.24, this has a long, long history; we cannot claim it's suddenly a blocker after 25 years ;-)

ferdymercury commented 3 years ago

Certainly - but the perf impact we're concerned about is the if in the case where no diagnostic is emitted, not the subsequent warning.

Ok, I see. Probably it doesn't make a difference performance-wise, but there are functions to calculate what the next available float value is: https://en.cppreference.com/w/cpp/numeric/math/nextafter

Side note: we are already losing some performance in TH1I by checking for overflow, but not for TH1F.

See:

 void TH1I::AddBinContent(Int_t bin, Double_t w)
 {
    Long64_t newval = fArray[bin] + Long64_t(w);
    if (newval > -INT_MAX && newval < INT_MAX) {fArray[bin] = Int_t(newval); return;}
    if (newval < -INT_MAX) fArray[bin] = -INT_MAX;
    if (newval >  INT_MAX) fArray[bin] =  INT_MAX;
 }

vs

void TH1F::AddBinContent(Int_t bin, Double_t w)
                                 {fArray[bin] += Float_t (w);}
dpiparo commented 8 months ago

I'd like to wrap this up. Maybe I am naive but, as pointed out by Rene on JIRA, isn't this understood? We are trying to increment by one a single precision floating point number and at some point, the gap between 2 subsequent 32bit fp numbers is larger than 1...

eguiraud commented 8 months ago

My two cents.

The root cause of the problem is understood and it is what you mentioned.

However, the way this presents to users is that if they run the following code on a large-enough inputs:

>>> f = ROOT.TFile('DYJetsToLL.root')
>>> t = f.Get('Events')
>>> t.Draw('GenPart_pdgId')

users silently get a completely wrong histogram:

image

instead of the histogram they should get:

image

Even if you are lucky enough that the histogram is so wrong that you spot the problem immediately, it is then very difficult to go from "my histogram is wrong" to "low floating point precision makes Fill effectively a no-op inside TTree::Draw". So difficult that I do not think one can reasonably expect users to diagnose the issue themselves.

With that said, new interfaces already moved to TH1D by default instead of TH1F, and TTree::Draw cannot be changed to produce a TH1D for the backward-compat issues Axel listed above, so the only remaining option, AFAICT, is to add a check after every Fill or deprecate TTree::Draw.

guitargeek commented 8 months ago

Thanks Enrico, I was just about to write the same :laughing: I think once we implement such a precision loss check in TH*F::Fill() we should be fine. I would be hesitant to implement this for general TH types because of performance overhead. But given that we encourage people to use THD because of the reasons that became apparent in this discussion, I think it's fine to do this for THF. What is you opinion on this? Deprecating TTree::Draw() is a whole different beast that is maybe outside the scope of this discussion

lmoneta commented 8 months ago

Hi, I agree to add the precision check in THF, and we already have this check-in THI, THS and THC!