Open stwunsch opened 3 years ago
@stwunsch which one is correct?
Good question! I guess the lower one, also checked by counting only the respective entries in the GenPart_pdgId
vector.
Well, it's a DY sample, so physically spoken we would expect the upper one (?) Let's say, it's not so clear what's going on :sweat_smile:
Can you enforce the same exact binning in both and also try with raw TTree and raw TTreeReader? If RDF or TTree is producing wrong histograms it's a critical issue if not a release blocker.
@eguiraud Do we agree that the code above should produce (in principle) the same output?
@eguiraud Here you can find the plots produced with the same binning.
>>> import ROOT
>>> f = ROOT.TFile('DYJetsToLL.root')
>>> t = f.Get('Events')
>>> h = ROOT.TH1F('h', 'h', 100, -20, 20)
>>> c = ROOT.TCanvas()
>>> t.Draw('GenPart_pdgId>>h')
>>> import ROOT
>>> c = ROOT.TCanvas()
>>> h = ROOT.RDataFrame('Events', 'DYJetsToLL.root').Histo1D(('h', 'h', 100, -20, 20), 'GenPart_pdgId')
>>> h.Draw()
As a cross check: The manual count in RDF of the entries we see in the histogram.
>>> ROOT.RDataFrame('Events', 'DYJetsToLL.root').Define('n', 'Sum(GenPart_pdgId == 11)').Sum('n').GetValue()
36148151.0
>>> ROOT.RDataFrame('Events', 'DYJetsToLL.root').Define('n', 'Sum(GenPart_pdgId == 13)').Sum('n').GetValue()
25642431.0
>>> ROOT.RDataFrame('Events', 'DYJetsToLL.root').Define('n', 'Sum(GenPart_pdgId == 15)').Sum('n').GetValue()
18289536.0
Do we agree that the code above should produce (in principle) the same output?
I don't see why not. We really need an authoritative answer as to which of the plots is correct though. Maybe raw TTree and TTreeReader can act as tie-breaker. There are no missing entries (the number of entries in the two histograms is the same) so it seems that one of the two interfaces is reading wrong values -- but how can it be that histogram mean and variance are exactly the same in the two cases? The mean, I can understand, but the variance should be larger for the upper histo than for the lower histo.
Here a reproducer with my best raw TTree skills:
void test() {
auto f = TFile::Open("DYJetsToLL.root");
auto t = f->Get<TTree>("Events");
UInt_t n;
Int_t x[100];
t->SetBranchAddress("nGenPart", &n);
t->SetBranchAddress("GenPart_pdgId", x);
auto n11 = 0u;
auto n13 = 0u;
auto n15 = 0u;
for(auto i = 0u; i < t->GetEntries(); i++) {
t->GetEntry(i);
for (auto j = 0u; j < n; j++) {
if (x[j] == 11) n11++;
if (x[j] == 13) n13++;
if (x[j] == 15) n15++;
}
}
cout << "n11: " << n11 << endl;
cout << "n13: " << n13 << endl;
cout << "n15: " << n15 << endl;
}
n11: 36148151
n13: 25642431
n15: 18289536
So we can conclude that's a plotting issue in TTree::Draw
? And I think RDataFrame
does the right thing.
Another ingredient, tree.Draw('GenPart_pdgId[0]')
gives a correct plot:
Ok, thanks for investigating. I think it's safe to treat it as a TTree::Draw
bug unless @pcanal figures out otherwise.
Ah, one last thing: could it be there are that many points that are right at the bin edges? it would explain why mean and variance are the same in the two cases. Can you try with a less regular or much more fine-grained binning?
If the edge would be the issue, we would see more bins filled than the six we have in the plot, right? The second round of plots uses a equidistant binning with 100 bins between -20 and 20.
This should be the used bins:
>>> import numpy as np
>>> np.linspace(-20, 20, 101)
array([-20. , -19.6, -19.2, -18.8, -18.4, -18. , -17.6, -17.2, -16.8,
-16.4, -16. , -15.6, -15.2, -14.8, -14.4, -14. , -13.6, -13.2,
-12.8, -12.4, -12. , -11.6, -11.2, -10.8, -10.4, -10. , -9.6,
-9.2, -8.8, -8.4, -8. , -7.6, -7.2, -6.8, -6.4, -6. ,
-5.6, -5.2, -4.8, -4.4, -4. , -3.6, -3.2, -2.8, -2.4,
-2. , -1.6, -1.2, -0.8, -0.4, 0. , 0.4, 0.8, 1.2,
1.6, 2. , 2.4, 2.8, 3.2, 3.6, 4. , 4.4, 4.8,
5.2, 5.6, 6. , 6.4, 6.8, 7.2, 7.6, 8. , 8.4,
8.8, 9.2, 9.6, 10. , 10.4, 10.8, 11.2, 11.6, 12. ,
12.4, 12.8, 13.2, 13.6, 14. , 14.4, 14.8, 15.2, 15.6,
16. , 16.4, 16.8, 17.2, 17.6, 18. , 18.4, 18.8, 19.2,
19.6, 20. ])
The values -15, -13, -11, 11, 13 and 15 are not on the edges.
Fair enough :smile:
An interesting observation: if you look at the bars in the two plots, for the RDF plot you can estimate the total number of entries as 37e6*2 + 26e6*2 + 18e6*2 = 1.62e8
which is consistent with the number of entries reported by the histogram. For the TTree::Draw
plot you get, more or less, 16.75e6 * 6 = 1.01e8
: why are the other entries not displayed? Did they end up in the overflow/underflow bins?
No underflows or overflows:
>>> import ROOT
>>> f = ROOT.TFile('DYJetsToLL.root')
>>> t = f.Get('Events')
>>> c = ROOT.TCanvas()
>>> h = ROOT.TH1F('h', 'h', 100, -20, 20)
>>> t.Draw('GenPart_pdgId>>h')
>>> h.GetBinContent(0)
0.0
>>> h.GetBinContent(21)
0.0
>>> h.Integral()
100663296.0
@eguiraud But have a look at the integral! Indeed, your estimation is fully correct.
Last round of investigations, here the Events->Print()
output:
*............................................................................*
*Br 62 :nGenPart : nGenPart/i *
*Entries : 30458871 : Total Size= 121869654 bytes File Size = 27764921 *
*Baskets : 349 : Basket Size= 574464 bytes Compression= 4.39 *
*............................................................................*
*Br 67 :GenPart_pdgId : GenPart_pdgId[nGenPart]/I *
*Entries : 30458871 : Total Size= 767472277 bytes File Size = 175369010 *
*Baskets : 698 : Basket Size= 1738752 bytes Compression= 4.38 *
*............................................................................*
And the issue is also visible in TBrowser
by clicking on the branch:
Running TTree::Scan should give an idea of the type of error made by TTree::Draw.
TTree::Scan
looks fine, I guess.
$ root -l DYJetsToLL.root
root [0]
Attaching file DYJetsToLL.root as _file0...
(TFile *) 0x55a4bb156060
root [1] Events->Scan("GenPart_pdgId")
***********************************
* Row * Instance * GenPart_p *
***********************************
* 0 * 0 * 15 *
* 0 * 1 * 11 *
* 0 * 2 * -15 *
* 0 * 3 * 11 *
* 0 * 4 * -11 *
* 0 * 5 * 11 *
* 0 * 6 * 15 *
* 0 * 7 * -11 *
* 0 * 8 * 15 *
* 0 * 9 * 11 *
* 0 * 10 * 15 *
* 0 * 11 * -15 *
* 0 * 12 * 15 *
* 0 * 13 * 15 *
* 0 * 14 * 11 *
* 0 * 15 * 11 *
* 0 * 16 * 15 *
* 0 * 17 * 15 *
* 0 * 18 * 11 *
* 1 * 0 * -13 *
* 1 * 1 * -13 *
* 1 * 2 * -11 *
* 1 * 3 * 13 *
* 1 * 4 * 11 *
* 1 * 5 * 13 *
@pcanal ping (this is marked as critical)
Found a smoking gun with @stwunsch : given a file with more than 16777216 entries, let's say, 50e6:
>>> ROOT.RDataFrame(50000000).Define("x", "1").Snapshot("t", "f.root")
TTree::Draw
caps the contents of each bin at 16777216 (note in particular that h->GetEntries()
returns the correct number, 50e6, but the bin contents are wrong:
root [1] t->Draw("x>>h")
Info in <TCanvas::MakeDefCanvas>: created default TCanvas with name c1
root [2] h->GetEntries()
(double) 50000000.
root [3] h->GetNbinsX()
(int) 3
root [4] h->GetBinContent(2)
(double) 16777216.
root [5] h->GetBinContent(0)
(double) 0.0000000
root [6] h->GetBinContent(1)
(double) 0.0000000
root [7] h->GetBinContent(3)
(double) 0.0000000
@pcanal help
So, if you run on a few entries, the result are consistent between RDataFrame and TTree::Draw?
Yes
Note that the TTree in the last repro is really nothing special, it just has one integer (scalar) branch with all values equal to 1
Is the histogram of the same type? (TH1I vs TH1F)
in the snippet above, h
is a TH1F
Then it definitively weird :( ... Would have to see whether the cutoff is down in TSelectorDraw itself or somehow in the histogram ...
Simple repo:
void repo() {
int i = 10000;
TTree t("t", "title");
t.Branch("b", &i, "i/I");
for (int e = 0; e < 50000000; ++e) {
t.Fill();
}
t.Draw("b.i");
std::cout << ((TH1*)gDirectory->FindObject("htemp"))->GetMaximum() << '\n';
}
Prints 1.67772e+07
and really shouldn't.
Here is a plain C++ reproducer, without ROOT ;)
#include <iostream>
int main() {
float x = 0.f;
for (int i = 0u; i < 50000000; i++) {
x += double(1.0);
}
std::cout << x << std::endl;
// returns the magic number 1.67772e+07
}
The culprit in ROOT is precisely here: https://github.com/root-project/root/blob/master/tree/treeplayer/src/TSelectorDraw.cxx#L1361
Edit: More precisely, it's filling > 2e7 double(1.0)
into a TH1F bin, same than the C++ only reproducer above.
Edit: We don't see this in RDataFrame
because it's using TH1Ds.
Turns out this is a duplicate of https://sft.its.cern.ch/jira/browse/ROOT-2956
@lmoneta given that we didn't understand this, and that this is likely not obvious to our users, could we add a test to AddBinContent()
of the form:
if (w != 0 && bincontent + w == bincontent)
Warning("AddBinContent", "Bin too full, adding %d is below its precision and does not change the bin content!", w);
Yes, but why using TH1F ? Everybody should always use TH1D, unless there are some memory issues. I have seen problem like this already too many times
From my experience, everyone is using TH1Fs because the datatype in most ntuple is float and not double (there because of the size). Most users may think that if the branch type is float, you should also use a TH1F. Not necessarily correct, but I can guarantee you that this happens a lot.
And more important, especially for NanoAOD, the following very typical user code will show a wrong result, on rather small files (see the initial reproducer):
>>> import ROOT
>>> f = ROOT.TFile('DYJetsToLL.root')
>>> t = f.Get('Events')
>>> t.Draw('GenPart_pdgId')
The bare minimum would fixing the type of histogram, that TTree::Draw
returns (TH1D, not TH1F), I guess.
This is completely wrong, an histogram type has nothing to do with the data type. The axis binning is always used in double precision. We should probably then specify this better in the TH1 documentation!
And yes TTree::Draw
should return a TH1D. @pcanal, can this be changed ?
RDataFrame.Histo1D returns TH1D already
This is completely wrong, an histogram type has nothing to do with the data type. The axis binning is always used in double precision. We should probably then specify this better in the TH1 documentation!
And that's not how ROOT histograms are used ;)
https://root.cern/doc/master/classTH1.html#associated-errors
I also would have understood the docs as the precision on the value that you fill in rather than the precision on the bin content. And it's also not a precision in that sense, it's a maximum bin content you can achieve by doing small increments. Actually, in principle you could have larger bin contents with float, e.g., with large weights.
And yes TTree::Draw should return a TH1D. @pcanal, can this be changed ?
This is straightforward to change (in TSelectorDraw), it is the obvious solution to the problem (albeit it actually just delay the problem per se .. but by a lot I guess :) ) but is not without visible consequences:
Forget it: 🦔 [Edit: ouch, not hedgehog but groundhog! You know, groundhog day. Whatever. 😅 ]
TH1D
: 0ad637e2105e0ef248029f6377e5e73323333e00htemp
to TH1F
: if they then access the TArrayF
(that's actually a TArrayD
) then that's very bad. We even had such user code in ROOT: a9570d37277a13f7871ee64e0f522725fc250e1eTH1F
Hey, what about https://github.com/root-project/root/issues/6671#issuecomment-791443564 ?
What's the performance impact of https://github.com/root-project/root/issues/6671#issuecomment-791443564? I guess that's the question. Otherwise, I'd be all in!
What's the performance impact of #6671 (comment)? I guess that's the question. Otherwise, I'd be all in!
A warning would be helpful indeed, but maybe only for the first time this happens? Just to avoid getting 1000 lines printed out in the terminal.
Just as a side note, there is another TTree.Draw bug out there concerning double (or ULong) precision / failing silently as in this issue: https://sft.its.cern.ch/jira/browse/ROOT-8009
A warning would be helpful indeed, but maybe only for the first time this happens?
Certainly - but the perf impact we're concerned about is the if
in the case where no diagnostic is emitted, not the subsequent warning.
Removing v6.24 as milestone: while I'd be happy to have this addressed in v6.24, this has a long, long history; we cannot claim it's suddenly a blocker after 25 years ;-)
Certainly - but the perf impact we're concerned about is the
if
in the case where no diagnostic is emitted, not the subsequent warning.
Ok, I see. Probably it doesn't make a difference performance-wise, but there are functions to calculate what the next available float value is:
https://en.cppreference.com/w/cpp/numeric/math/nextafter
Side note: we are already losing some performance in TH1I by checking for overflow, but not for TH1F.
See:
void TH1I::AddBinContent(Int_t bin, Double_t w)
{
Long64_t newval = fArray[bin] + Long64_t(w);
if (newval > -INT_MAX && newval < INT_MAX) {fArray[bin] = Int_t(newval); return;}
if (newval < -INT_MAX) fArray[bin] = -INT_MAX;
if (newval > INT_MAX) fArray[bin] = INT_MAX;
}
vs
void TH1F::AddBinContent(Int_t bin, Double_t w)
{fArray[bin] += Float_t (w);}
I'd like to wrap this up. Maybe I am naive but, as pointed out by Rene on JIRA, isn't this understood? We are trying to increment by one a single precision floating point number and at some point, the gap between 2 subsequent 32bit fp numbers is larger than 1...
My two cents.
The root cause of the problem is understood and it is what you mentioned.
However, the way this presents to users is that if they run the following code on a large-enough inputs:
>>> f = ROOT.TFile('DYJetsToLL.root')
>>> t = f.Get('Events')
>>> t.Draw('GenPart_pdgId')
users silently get a completely wrong histogram:
instead of the histogram they should get:
Even if you are lucky enough that the histogram is so wrong that you spot the problem immediately, it is then very difficult to go from "my histogram is wrong" to "low floating point precision makes Fill
effectively a no-op inside TTree::Draw
". So difficult that I do not think one can reasonably expect users to diagnose the issue themselves.
With that said, new interfaces already moved to TH1D by default instead of TH1F, and TTree::Draw cannot be changed to produce a TH1D for the backward-compat issues Axel listed above, so the only remaining option, AFAICT, is to add a check after every Fill
or deprecate TTree::Draw
.
Thanks Enrico, I was just about to write the same :laughing: I think once we implement such a precision loss check in TH*F::Fill()
we should be fine. I would be hesitant to implement this for general TH
types because of performance overhead. But given that we encourage people to use THD
because of the reasons that became apparent in this discussion, I think it's fine to do this for THF
. What is you opinion on this? Deprecating TTree::Draw()
is a whole different beast that is maybe outside the scope of this discussion
Hi, I agree to add the precision check in THF, and we already have this check-in THI, THS and THC!
See for comparison the screenshot below.
The upper plot was done with
TTree.Draw
:The lower plot was done with
RDataFrame.Histo1D
:I've used ROOT 6.22/02 and you can download the file here:
http://opendata.web.cern.ch/record/12353