Open jpivarski opened 4 years ago
Add to that one more: I think scikit-hep/uproot#510 is another example of that. Look at the second file, uproot-issue510b.root:
>>> import uproot4, skhep_testdata
>>> t = uproot4.open(skhep_testdata.data_path("uproot-issue510.root"))["EDepSimEvents"]
>>> b = t["Event"]["Trajectories.Points"]
>>> b.debug(0)
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
64 0 101 102 64 9 0 1 0 0 0 2 0 1 0 0 0 0 2 0
@ --- e f @ --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
0 0 0 1 0 0 0 0 2 0 0 0 64 0 0 60 0 4 0 1
--- --- --- --- --- --- --- --- --- --- --- --- @ --- --- < --- --- --- ---
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
0 0 0 0 2 0 0 0 64 0 0 36 0 3 0 1 0 0 0 0
--- --- --- --- --- --- --- --- @ --- --- $ --- --- --- --- --- --- --- ---
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
2 0 0 0 64 103 237 14 20 44 204 192 64 99 170 169 116 55 10 48
--- --- --- --- @ g --- --- --- , --- --- @ c --- --- t 7 --- 0
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
64 192 63 130 249 6 230 103 63 240 0 0 0 0 0 0 64 0 0 60
@ --- ? --- --- --- --- g ? --- --- --- --- --- --- --- @ --- --- <
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
0 4 0 1 0 0 0 0 2 0 0 0 64 0 0 36 0 3 0 1
--- --- --- --- --- --- --- --- --- --- --- --- @ --- --- $ --- --- --- ---
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
0 0 0 0 2 0 0 0 64 104 149 31 100 192 97 100 64 98 140 241
--- --- --- --- --- --- --- --- @ h --- --- d --- a d @ b --- ---
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
140 93 110 7 64 192 67 18 202 151 200 123 63 240 171 196 70 133 147 27
--- ] n --- @ --- C --- --- --- --- { ? --- --- --- F --- --- ---
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
64 0 0 36 0 3 0 1 0 0 0 0 2 0 0 0 64 4 167 135
@ --- --- $ --- --- --- --- --- --- --- --- --- --- --- --- @ --- --- ---
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
146 180 211 210 192 17 142 116 33 185 246 118 64 12 3 158 90 174 184 82
--- --- --- --- --- --- --- t ! --- --- v @ --- --- --- Z --- --- R
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
64 0 0 36 0 3 0 1 0 0 0 0 2 0 0 0 0 0 0 0
@ --- --- $ --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
0 0 0 0 128 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
--- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
0 0 0 0 0 0 0 2
--- --- --- --- --- --- --- ---
--+---+---+---+---+---+---+---+
This is a collection of std::vector<TG4TrajectoryPoint>
, where TG4TrajectoryPoint
is
>>> tree.file.streamer_named("TG4TrajectoryPoint").show()
TG4TrajectoryPoint (v1): TObject (v1)
Position: TLorentzVector (TStreamerObject)
Momentum: TVector3 (TStreamerObject)
Process: int (TStreamerBasicType)
Subprocess: int (TStreamerBasicType)
The first 6 bytes is header as usual: 64 0 101 102 64 9
. (That's the right num_bytes
for the entry.)
Next, we're looking at a split std::vector
header:
| 0 1 | 0 0 0 2 | 0 1 | 0 0 0 0 | 2 0 0 0 | 0 1 | 0 0 0 0 | 2 0 0 0 |
| | two objects | bits for #1 | bits for #2 |
Then follow two TLorentzVectors:
[191.4079686045643, 157.33318529844246, 8319.023224699498, 1.0]
[196.6600822217398, 148.4044858765412, 8326.14680764474, 1.0419352297551032]
and two TVector3:
[2.5818015538629675, -4.389114882445588, 3.5017668804712594]
[0.0, -0.0, 0.0]
and two integers, 0
and 2
.
Following that is a header for 0 objects and then a header for 32 objects:
--+---+---+---+---+---+---+---+---+---+---+---+-
0 0 0 0 0 0 0 14 0 0 0 32
--- --- --- --- --- --- --- --- --- --- ---
--+---+---+---+---+---+---+---+---+---+---+---+-
and, indeed, there are 32 ten-byte std::vector
headers:
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
0 1 0 0 0 0 2 0 0 0 0 1 0 0 0 0 2 0 0 0
--- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
0 1 0 0 0 0 2 0 0 0 0 1 0 0 0 0 2 0 0 0
--- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
0 1 0 0 0 0 2 0 0 0 0 1 0 0 0 0 2 0 0 0
--- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
0 1 0 0 0 0 2 0 0 0 0 1 0 0 0 0 2 0 0 0
--- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
0 1 0 0 0 0 2 0 0 0 0 1 0 0 0 0 2 0 0 0
--- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
0 1 0 0 0 0 2 0 0 0 0 1 0 0 0 0 2 0 0 0
--- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
0 1 0 0 0 0 2 0 0 0 0 1 0 0 0 0 2 0 0 0
--- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
0 1 0 0 0 0 2 0 0 0 0 1 0 0 0 0 2 0 0 0
--- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
0 1 0 0 0 0 2 0 0 0 0 1 0 0 0 0 2 0 0 0
--- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
0 1 0 0 0 0 2 0 0 0 0 1 0 0 0 0 2 0 0 0
--- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
0 1 0 0 0 0 2 0 0 0 0 1 0 0 0 0 2 0 0 0
--- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
0 1 0 0 0 0 2 0 0 0 0 1 0 0 0 0 2 0 0 0
--- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
0 1 0 0 0 0 2 0 0 0 0 1 0 0 0 0 2 0 0 0
--- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
0 1 0 0 0 0 2 0 0 0 0 1 0 0 0 0 2 0 0 0
--- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
0 1 0 0 0 0 2 0 0 0 0 1 0 0 0 0 2 0 0 0
--- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
0 1 0 0 0 0 2 0 0 0 0 1 0 0 0 0 2 0 0 0
--- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
Right after that, the TLorentzVectors start up again:
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
64 0 0 60 0 4 0 1 0 0 0 0 2 0 0 0 64 0 0 36
@ --- --- < --- --- --- --- --- --- --- --- --- --- --- --- @ --- --- $
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
0 3 0 1 0 0 0 0 2 0 0 0 64 103 237 14 20 44 204 192
--- --- --- --- --- --- --- --- --- --- --- --- @ g --- --- --- , --- ---
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
64 99 170 169 116 55 10 48 64 192 63 130 249 6 230 103 63 240 0 0
@ c --- --- t 7 --- 0 @ --- ? --- --- --- --- g ? --- --- ---
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
0 0 0 0
--- --- --- ---
--+---+---+---+
This one is
[191.4079686045643, 157.33318529844246, 8319.023224699498, 1.0]
Similarly, there's also a "name" field that claims to have type std::string
:
>>> t["Event"]["Trajectories.Name"].streamer
<TStreamerSTLstring at 0x7f33475eaf10>
but it's clearly a collection of strings (53 of them):
>>> t["Event"]["Trajectories.Name"].debug(0)
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
64 0 0 214 0 9 5 103 97 109 109 97 3 109 117 45 6 112 114 111
@ --- --- --- --- --- --- g a m m a --- m u - --- p r o
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
116 111 110 6 112 114 111 116 111 110 6 112 114 111 116 111 110 6 112 114
t o n --- p r o t o n --- p r o t o n --- p r
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
111 116 111 110 6 112 114 111 116 111 110 6 112 114 111 116 111 110 7 110
o t o n --- p r o t o n --- p r o t o n --- n
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
101 117 116 114 111 110 7 110 101 117 116 114 111 110 7 110 101 117 116 114
e u t r o n --- n e u t r o n --- n e u t r
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
111 110 7 110 101 117 116 114 111 110 2 101 45 2 101 45 2 101 45 2
o n --- n e u t r o n --- e - --- e - --- e - ---
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
101 45 2 101 45 2 101 45 2 101 45 2 101 45 2 101 45 2 101 45
e - --- e - --- e - --- e - --- e - --- e - --- e -
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
2 101 45 2 101 45 2 101 45 2 101 45 2 101 45 2 101 45 2 101
--- e - --- e - --- e - --- e - --- e - --- e - --- e
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
45 2 101 45 2 101 45 2 101 45 2 101 45 2 101 45 2 101 45 2
- --- e - --- e - --- e - --- e - --- e - --- e - ---
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
101 45 2 101 45 2 101 45 2 101 45 2 101 45 2 101 45 2 101 45
e - --- e - --- e - --- e - --- e - --- e - --- e -
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
2 101 45 2 101 45 2 101 45 2 101 45 2 101 43 7 110 101 117 116
--- e - --- e - --- e - --- e - --- e + --- n e u t
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
114 111 110 2 101 45 2 101 45 2 101 45 2 101 45 2 101 45
r o n --- e - --- e - --- e - --- e - --- e -
The key: I think these are both TClonesArrays! They have a non-empty fTClonesName
member.
@tamasgal, I know that you're busy with Unroot.jl, but if you're ever interested in solving a mystery, we now have 6 issues that are due to cases in which ROOT writes the subentries 1
, 2
, 3
of structs with fields a
b
as
a1 a2 a3 b1 b2 b3
instead of
a1 b1 a2 b2 a3 b3
It's a lot like branch-splitting, but this happens inside of each entry. I've been able to reverse engineer the fact that this is happening, but not why it's happening: what information in the TBranch(Element), its parent branches, maybe the TTree itself, and associated streamers might tell us that we should deserialize it this way instead of the normal way. If you find anything that might shed some light on it (or even what this mode is named!), I'd be grateful.
The cases in this function:
are guesses, based on a few examples, so don't copy them without care!
Also, if you know of anyone else who's inclined to dig into these details, let me know. I'm looking for help!
Oh, that seems to be a tough one. I bookmarked it and try to see if I find something new! Currently I am more busy with my PhD than anything else though, but I am certainly interested :sweat_smile:
I think I might have seen something similar in KM3NeT data, I have to dig in my notes, I hope I find it since for that I could also provide the source code...
I just learned from Philippe that it's called "memberwise streaming" (as opposed to "objectwise streaming") and the ROOT code for deserializing these objects is here:
The way to identify that a particular object is serialized this way is by checking for bit 14 (2**14 == 16384
) in the instance version, TBufferFile::kStreamedMemberWise. Also, that bit has to be removed from the instance version before comparing it with the class version (second line of the quoted code above).
Indeed, in the uproot-issue510b.root file I investigated above, the version number does have bit 14 set:
>>> import uproot4, skhep_testdata
>>> t = uproot4.open(skhep_testdata.data_path("uproot-issue510b.root"))["EDepSimEvents"]
>>> b = t["Event"]["Trajectories.Points"]
>>> b.debug(0, limit_bytes=6)
--+---+---+---+---+---+
64 0 101 102 64 9
@ --- e f @ ---
--+---+---+---+---+---+
The first four bytes is the size of this entry (with the kByteCountMask
bit removed),
>>> numpy.array([64, 0, 101, 102], "u1").view(">u4") & ~(2**30)
array([25958])
and the next two bytes is the version number with a kStreamedMemberWise
bit set,
>>> numpy.array([64, 9], "u1").view(">u2") & ~(2**14)
array([9], dtype=int32)
So these things are identified one object at a time (even though a branch is likely to consist entirely of one type of serialization or the other).
For making tests, I think the way a class can be put into this mode is by calling TClass::SetCanSplit(true)
on its TClass object (TClass::GetClass("class name")
). I'm not 100% certain whether this controls memberwise/objectwise splitting, ordinary branch splitting, or both. But it would be nice to see the same class written as memberwise and as objectwise, for confidence that we're doing it right.
root/test/bench.cxx might make examples with and without memberwise splitting, but this is part of ROOT's benchmark tests and relies on other code that I haven't followed to its definitions. It might be possible to simply run this benchmark to generate files with memberwise and objectwise serialization.
TVirtualStreamerInfo has a SetStreamMemberWise(bool)
method, but I don't know if that means we can directly use it to make tests.
I'm just writing these things here as notes, so that this information does not get lost.
This should be considered a bug at least until we have a "not implemented" error message for this case, but fully implementing it is a feature. I think I'll put in one PR to add the "not implemented" message and then remove the "bug" label from this issue.
Oh wow, I didn't even have a chance 😅
Nice to hear that the mystery is mostly solved.
You most certainly also found this thread (also from Philippe) https://root-forum.cern.ch/t/splitability-of-classes-with-custom-streamer/32974 for me especially this statement from Philippe was quite new:
If a class has a custom Streamer we have to assume that it is for a good reason :). When splitting is used, the custom Streamer is not used at all and thus we are (silently) not doing what the user (likely) intended often leading to corrupted results.
Btw. I searched our codebase for SetCanSplit
but have not found any use of that. I also have not found my notes which were about a strange split structure just like you described, only some sketches of the split-branch strategy which is well-known.
I just asked him about it at the ROOT I/O meeting, which is every Friday:
Oh nice, it seems to be open for externals, at least I was able to join the video room ;)
So nicely, #209 has a TEfficiency which looks like a good vector of attack (pun intended) for memberwise serialization. This can be easily made like so
import ROOT
fp = ROOT.TFile.Open("test-efficiency.root", "RECREATE")
nbins = 11
h_den = ROOT.TH1F('h_den', 'h_den', nbins, 0, 100)
h_num = ROOT.TH1F('h_num', 'h_num', nbins, 0, 100)
for i in range(1, nbins):
h_num.SetBinContent(i, 2**i)
h_den.SetBinContent(i, 2**(i+1))
eff = ROOT.TEfficiency(h_num, h_den)
eff.SetName('TEfficiencyName')
eff.SetTitle('TEfficiencyTitle')
h_den.Write()
h_num.Write()
eff.Write()
fp.Close()
to get a small ROOT file to play around with. This tefficiency does indicate a crash when doing
with uproot.open('test-efficiency.root') as fp:
eff = fp['TEfficiencyName']
like so
Traceback (most recent call last):
File "run.py", line 12, in <module>
tree = fp['TEfficiencyName']
File "/Users/kratsg/uproot4/uproot/reading.py", line 1979, in __getitem__
return self.key(where).get()
File "/Users/kratsg/uproot4/uproot/reading.py", line 2364, in get
out = cls.read(chunk, cursor, context, self._file, selffile, parent)
File "/Users/kratsg/uproot4/uproot/model.py", line 1181, in read
versioned_cls.read(
File "/Users/kratsg/uproot4/uproot/model.py", line 800, in read
self.read_members(chunk, cursor, context, file)
File "<dynamic>", line 12, in read_members
File "/Users/kratsg/uproot4/uproot/containers.py", line 798, in read
raise NotImplementedError(
NotImplementedError: memberwise serialization of AsVector
in file test-efficiency.root
so we can start here. This is also quick to iterate and make new ROOT files with different values to determine that we have the right offsets.
The reason for the SetName
and SetTitle
is to match the file in #209 that has the issue. So at least it's just trying to match the structure there as much as possible.
Here's the full sequence of that TEfficiency with the following histograms stored:
$ cat tefficiency.py
import ROOT
#fp = ROOT.TFile.Open('uproot4-issue209.root')
fp = ROOT.TFile.Open('test-efficiency.root')
eff = fp.TEfficiencyName
num = eff.GetPassedHistogram()
den = eff.GetTotalHistogram()
print([num.GetBinContent(i) for i in range(len(num)+1)])
print([den.GetBinContent(i) for i in range(len(den)+1)])
$ roopython3 tefficiency.py
[0.0, 2.0, 4.0, 8.0, 16.0, 32.0, 64.0, 128.0, 256.0, 512.0, 1024.0, 0.0, 0.0, 0.0]
[0.0, 4.0, 8.0, 16.0, 32.0, 64.0, 128.0, 256.0, 512.0, 1024.0, 2048.0, 0.0, 0.0, 0.0]
as below.
the numerator is at 603
and then +52 bytes for the full array contents
# offset=623, dtype=">f4"
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
0 0 13 0 0 0 0 64 0 0 0 64 128 0 0 65 0 0 0 65
--- --- --- --- --- --- --- @ --- --- --- @ --- --- --- A --- --- --- A
0.0 2.0 4.0 8.0
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
128 0 0 66 0 0 0 66 128 0 0 67 0 0 0 67 128 0 0 68
--- --- --- B --- --- --- B --- --- --- C --- --- --- C --- --- --- D
16.0 32.0 64.0 128.0 256.0
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
0 0 0 68 128 0 0 0 0 0 0 0 0 0 0 0 0 0 0 64
--- --- --- D --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- @
512.0 1024.0 0.0 0.0 0.0
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
and the denominator is at 1256
and then +52 bytes for the full array contents
# offset=1256, dtype=">f4"
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 13 0 0 0 0
--- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
0.0
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
64 128 0 0 65 0 0 0 65 128 0 0 66 0 0 0 66 128 0 0
@ --- --- --- A --- --- --- A --- --- --- B --- --- --- B --- --- ---
4.0 8.0 16.0 32.0 64.0
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
67 0 0 0 67 128 0 0 68 0 0 0 68 128 0 0 69 0 0 0
C --- --- --- C --- --- --- D --- --- --- D --- --- --- E --- --- ---
128.0 256.0 512.0 1024.0 2048.0
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
0 0 0 0 0 0 0 0 63 240 0 0 0 0 0 0
--- --- --- --- --- --- --- --- ? --- --- --- --- --- --- ---
0.0 0.0 1.875 0.0
Continuing further using this python, it seems to be that the memberwise is complaining about the AsVector::read
call. So if we look at the streamer for TEfficiency
here and the corresponding class code
>>> fp.file.streamer_named('TEfficiency').class_code()
the read_members
has this line:
self._members['fBeta_bin_params'] = self._stl_container0.read(chunk, cursor, context, file, self._file, self._concrete)
which indicates that the memberwise is failing for the fBeta_bin_params. So then going back to the python code for making the TEfficiency
, we do the following
eff.SetBetaBinParameters(0, -1.0, -2.0)
for i in range(1, nbins):
eff.SetBetaBinParameters(i, 2**i, 2**(i+1))
eff.SetBetaBinParameters(nbins, -1.0, -2.0)
which bookends the alpha parameters by -1.0
and the beta parameters by -2.0
to make them easier to identify. Dumping out the chunk again and playing with the offset a bit (finding double-precision values for these parameters, using >f8
), we find them:
(Pdb) cursor.debug(chunk, dtype=">f8", offset=2, limit_bytes=240)
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
0 0 0 215 190 210 0 0 0 13 191 240 0 0 0 0 0 0 64 0
--- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- @ ---
1.3525824167906353e-304 -1.0
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
0 0 0 0 0 0 64 16 0 0 0 0 0 0 64 32 0 0 0 0
--- --- --- --- --- --- @ --- --- --- --- --- --- --- @ --- --- --- ---
2.0 4.0
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
0 0 64 48 0 0 0 0 0 0 64 64 0 0 0 0 0 0 64 80
--- --- @ 0 --- --- --- --- --- --- @ @ --- --- --- --- --- --- @ P
8.0 16.0 32.0
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
0 0 0 0 0 0 64 96 0 0 0 0 0 0 64 112 0 0 0 0
--- --- --- --- --- --- @ ` --- --- --- --- --- --- @ p --- --- --- ---
64.0 128.0
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
0 0 64 128 0 0 0 0 0 0 64 144 0 0 0 0 0 0 191 240
--- --- @ --- --- --- --- --- --- --- @ --- --- --- --- --- --- --- --- ---
256.0 512.0 1024.0
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
0 0 0 0 0 0 63 240 0 0 0 0 0 0 192 0 0 0 0 0
--- --- --- --- --- --- ? --- --- --- --- --- --- --- --- --- --- --- --- ---
-1.0 1.0
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
0 0 64 16 0 0 0 0 0 0 64 32 0 0 0 0 0 0 64 48
--- --- @ --- --- --- --- --- --- --- @ --- --- --- --- --- --- @ 0
-2.0 4.0 8.0
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
0 0 0 0 0 0 64 64 0 0 0 0 0 0 64 80 0 0 0 0
--- --- --- --- --- --- @ @ --- --- --- --- --- --- @ P --- --- --- ---
16.0 32.0
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
0 0 64 96 0 0 0 0 0 0 64 112 0 0 0 0 0 0 64 128
--- --- @ ` --- --- --- --- --- --- @ p --- --- --- --- --- --- @ ---
64.0 128.0 256.0
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
0 0 0 0 0 0 64 144 0 0 0 0 0 0 64 160 0 0 0 0
--- --- --- --- --- --- @ --- --- --- --- --- --- --- @ --- --- --- --- ---
512.0 1024.0
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
0 0 192 0 0 0 0 0 0 0 63 240 0 0 0 0 0 0 63 229
--- --- --- --- --- --- --- --- --- --- ? --- --- --- --- --- --- --- ? ---
2048.0 -2.0 1.0
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
216 151 162 65 163 245 64 0 0 17 0 5 0 1 0 0 0 0 3 0
--- --- --- A --- --- @ --- --- --- --- --- --- --- --- --- --- --- --- ---
0.682689492137 2.0000324250722774
So to summarize so far, we have (for the chunk + cursor):
0 0 0 215 190 210 0 0 0 13
-1.0
)63 240 0 0 0 0 0 0
-2.0
)63 240 0 0 0 0 0 0
(unclear if this is part of the fBeta_bin_params` or not)Ok, getting further, we have a slight issue that I think uproot
might need to be refactored since I think memberwise streaming breaks the current Model.read
assumptions. Let me step through what I see happen.
with this initial "preprocessing" sequence, we have:
_num_memberwise_bytes=215
(maybe? there's actually 214 bytes left to read the rest of the structure, not 215.. coincidence?)_something_else=48850
(maybe a version of some sort? unsure. still seems very odd)length=13
(this at least makes sense)So far, so good. length
here refers to the length of the std::vector
(which we always want as we allocate that many values when reading). The _num_memberwise_bytes
is interesting, as it seems to be off-by-one perhaps (will remake this file but with more entries in the std::vector
to see...)
Now, the problem is the following. At this point, we make a call to _read_nested
:
values = _read_nested(
self._values, length, chunk, cursor, context, file, selffile, parent
)
which goes into this loop
for i in uproot._util.range(length):
values[i] = model.read(chunk, cursor, context, file, selffile, parent)
print(cursor, values[i])
where values
is a 13-element array allocated correctly. However, the values coming out of this are nonsensical... But it was clearly fine! In fact, the model.read
is shifting the cursor another 2 bytes to read a version. Here's a portion of model.read
:
self.hook_before_read(chunk=chunk, cursor=cursor, context=context, file=file)
self.read_numbytes_version(chunk, cursor, context)
where self.read_numbytes_version
will read in 2 bytes to grab the version. Problematically, as you can see, we have what might be a version number before the length instead of after. This makes things difficult. So my idea is one of two ways:
model.read
call I think]I found the hook
to be too hard to figure out (seems that it's something that ROOT might have to hook into before or after creating) so I ended up implementing the second option using rollback_nbytes
like so:
def _read_nested(
model, length, chunk, cursor, context, file, selffile, parent, header=True, rollback_nbytes=0
):
if isinstance(model, numpy.dtype):
return cursor.array(chunk, length, model, context)
else:
values = numpy.empty(length, dtype=_stl_object_type)
if isinstance(model, AsContainer):
for i in uproot._util.range(length):
cursor._index = cursor._index - rollback_nbytes
values[i] = model.read(
chunk, cursor, context, file, selffile, parent, header=header
)
else:
for i in uproot._util.range(length):
cursor._index = cursor._index - rollback_nbytes
values[i] = model.read(chunk, cursor, context, file, selffile, parent)
print(cursor, values[i])
return values
which is CLEARLY hacky but I'm ok with this for right now. I'm able to read out the memberwise object without an error, but I need to now teach the Cursor
to read in a memberwise fashion (jumping around the cursor for me instead).
Adding this in here to make sure we don't lose useful information: https://root-forum.cern.ch/t/how-to-enable-tbufferfile-kstreamedmemberwise-for-specific-branches-in-a-ttree/43788/6 .
I stumbled across this issue today and I am more confused than ever.
TL;DR version: I can read in TEfficiency
if I open skhep_testdata.data_path("uproot-issue38c.root")
before my file.
import uproot
import skhep_testdata
with uproot.open(skhep_testdata.data_path("uproot-issue209.root")) as fp:
eff = fp["TEfficiencyName"]
print(eff)
import uproot
import skhep_testdata
with uproot.open(skhep_testdata.data_path("uproot-issue38c.root")) as fp:
hist = fp["TEfficiencyName"] # need to load the TEfficiency
with uproot.open(skhep_testdata.data_path("uproot-issue209.root")) as fp:
eff = fp["TEfficiencyName"]
print(eff)
What kind of black magic is included in "uproot-issue38c.root" and how can I add to my files?
There is a global state change when you open a file (i.e. the black magic). There's a global uproot.classes
dict with Python Models for C++ class name-version pairs, such as TEfficiency version XYZ. When trying to read an instance of the class from a file, it first uses the Model in the global uproot.classes
, which defines some deserialization procedure. If that deserialization procedure fails, it then tries reading the specific file's TStreamerInfo, which encodes deserialization procedures for each class name-version in the file (maybe—some TStreamerInfos are missing some classes). If the second try fails, you get an error message. New class name-version combinations are added to the uproot.classes
dict when they're learned.
Although you'd think that a particular class name-version pair would always have the same deserialization procedure, maybe the file was made with a custom-compiled version of ROOT, which has new C++ members added to a class without a new version number, or maybe the file was hadd'ed with another that does, etc. That's why we have a try, try-again procedure, and even that might fail if it's weird enough and doesn't declare its weirdness.
To get more insight into what's going on in this case, you can look at
fp.file.show_streamers("TEfficiency")
(uproot.ReadOnlyFile.show_streamers) to see if there are different versions of some class (maybe one of the classes TEfficiency inherits from or contains) or if they have the same version but nevertheless different deserialization procedures (described as a sequence of member data and their types).
Hi Jim,
Thank you for the explanation. I've been trying to figure out the differences between the two files since I wanted to rewrite my old files with newer ROOT to add the streamers (if there is a way).
fp.file.show_streamers("TEfficiency")
Unfortunately, the output of that line is identical for both skhep_testdata.data_path("uproot-issue209.root")
and skhep_testdata.data_path("uproot-issue38c.root")
. There must be another difference between these two files.
According the the uproot test_0038-memberwise-serialization.py, uproot-issue209.root
should not contain any streamers (at least it fails also without reset_classes
), yet fp.file.show_streamers("TEfficiency")
reports them.
THashList (v0): TList (v5) TAttAxis (v4) fNdivisions: int (TStreamerBasicType) fAxisColor: short (TStreamerBasicType) fLabelColor: short (TStreamerBasicType) fLabelFont: short (TStreamerBasicType) fLabelOffset: float (TStreamerBasicType) fLabelSize: float (TStreamerBasicType) fTickLength: float (TStreamerBasicType) fTitleOffset: float (TStreamerBasicType) fTitleSize: float (TStreamerBasicType) fTitleColor: short (TStreamerBasicType) fTitleFont: short (TStreamerBasicType) TAxis (v10): TNamed (v1), TAttAxis (v4) fNbins: int (TStreamerBasicType) fXmin: double (TStreamerBasicType) fXmax: double (TStreamerBasicType) fXbins: TArrayD (TStreamerObjectAny) fFirst: int (TStreamerBasicType) fLast: int (TStreamerBasicType) fBits2: unsigned short (TStreamerBasicType) fTimeDisplay: bool (TStreamerBasicType) fTimeFormat: TString (TStreamerString) fLabels: THashList* (TStreamerObjectPointer) fModLabs: TList* (TStreamerObjectPointer) TH1 (v8): TNamed (v1), TAttLine (v2), TAttFill (v2), TAttMarker (v2) fNcells: int (TStreamerBasicType) fXaxis: TAxis (TStreamerObject) fYaxis: TAxis (TStreamerObject) fZaxis: TAxis (TStreamerObject) fBarOffset: short (TStreamerBasicType) fBarWidth: short (TStreamerBasicType) fEntries: double (TStreamerBasicType) fTsumw: double (TStreamerBasicType) fTsumw2: double (TStreamerBasicType) fTsumwx: double (TStreamerBasicType) fTsumwx2: double (TStreamerBasicType) fMaximum: double (TStreamerBasicType) fMinimum: double (TStreamerBasicType) fNormFactor: double (TStreamerBasicType) fContour: TArrayD (TStreamerObjectAny) fSumw2: TArrayD (TStreamerObjectAny) fOption: TString (TStreamerString) fFunctions: TList* (TStreamerObjectPointer) fBufferSize: int (TStreamerBasicType) fBuffer: double* (TStreamerBasicPointer) fBinStatErrOpt: TH1::EBinErrorOpt (TStreamerBasicType) fStatOverflows: TH1::EStatOverflows (TStreamerBasicType) TCollection (v3): TObject (v1) fName: TString (TStreamerString) fSize: int (TStreamerBasicType) TSeqCollection (v0): TCollection (v3) TList (v5): TSeqCollection (v0) TAttMarker (v2) fMarkerColor: short (TStreamerBasicType) fMarkerStyle: short (TStreamerBasicType) fMarkerSize: float (TStreamerBasicType) TAttFill (v2) fFillColor: short (TStreamerBasicType) fFillStyle: short (TStreamerBasicType) TAttLine (v2) fLineColor: short (TStreamerBasicType) fLineStyle: short (TStreamerBasicType) fLineWidth: short (TStreamerBasicType) TString (v2) TObject (v1) fUniqueID: unsigned int (TStreamerBasicType) fBits: unsigned int (TStreamerBasicType) TNamed (v1): TObject (v1) fName: TString (TStreamerString) fTitle: TString (TStreamerString) TEfficiency (v2): TNamed (v1), TAttLine (v2), TAttFill (v2), TAttMarker (v2) fBeta_alpha: double (TStreamerBasicType) fBeta_beta: double (TStreamerBasicType) fBeta_bin_params: vector> (TStreamerSTL) fConfLevel: double (TStreamerBasicType) fFunctions: TList* (TStreamerObjectPointer) fPassedHistogram: TH1* (TStreamerObjectPointer) fStatisticOption: TEfficiency::EStatOption (TStreamerBasicType) fTotalHistogram: TH1* (TStreamerObjectPointer) fWeight: double (TStreamerBasicType)
different deserialization procedures (described as a sequence of member data and their types).
But then I would expect the deserialization to return garbage of sorts (e.g. interpreting data for the wrong slots).
However, reading my old hists with ROOT and comparing them to uproot (with the workaround of loading uproot-issue38c.root
first), the only difference I see are the under- and overflow bins, which is just the difference between np.array(root_hist)
vs uproot_hist.to_numpy()
And my own files show a slight difference (older version of TH1):
29c29
< TH1 (v8): TNamed (v1), TAttLine (v2), TAttFill (v2), TAttMarker (v2)
---
> TH1 (v7): TNamed (v1), TAttLine (v2), TAttFill (v2), TAttMarker (v2)
51d50
< fStatOverflows: TH1::EStatOverflows (TStreamerBasicType)
Uproot has built-in Models for TH1 (v8), but not for TH1 (v7).
The purpose of this is to avoid having to read TStreamerInfo for the most common/most up-to-date files, but fall back on reading TStreamerInfo if necessary. Reading the file with the TH1 (v7) in it will change the global state of the uproot.classes
dict, but reading the file with the TH1 (v8) in it will only change it if it finds that the presumed data layout (the built-in Model) is wrong.
I wanted to rewrite my old files with newer ROOT to add the streamers (if there is a way).
Since both files produce output with fp.file.show_streamers()
, they both have streamers.
I just checked Uproot's built-in streamer for TH1 (v8), and it's the same as the TH1 (v8) in your file:
TH1 (v8): TNamed (v1), TAttLine (v2), TAttFill (v2), TAttMarker (v2)
fNcells: int (TStreamerBasicType)
fXaxis: TAxis (TStreamerObject)
fYaxis: TAxis (TStreamerObject)
fZaxis: TAxis (TStreamerObject)
fBarOffset: short (TStreamerBasicType)
fBarWidth: short (TStreamerBasicType)
fEntries: double (TStreamerBasicType)
fTsumw: double (TStreamerBasicType)
fTsumw2: double (TStreamerBasicType)
fTsumwx: double (TStreamerBasicType)
fTsumwx2: double (TStreamerBasicType)
fMaximum: double (TStreamerBasicType)
fMinimum: double (TStreamerBasicType)
fNormFactor: double (TStreamerBasicType)
fContour: TArrayD (TStreamerObjectAny)
fSumw2: TArrayD (TStreamerObjectAny)
fOption: TString (TStreamerString)
fFunctions: TList* (TStreamerObjectPointer)
fBufferSize: int (TStreamerBasicType)
fBuffer: double* (TStreamerBasicPointer)
fBinStatErrOpt: TH1::EBinErrorOpt (TStreamerBasicType)
fStatOverflows: TH1::EStatOverflows (TStreamerBasicType)
So whether it tries to read your file with the TH1 (v8) in it first or last, it does not change global state because the Model in uproot.classes
before reading the file agrees with the TStreamerInfo in the file.
But reading the file with the TH1 (v7) in it does change uproot.classes
, since it doesn't know about v7, so it has to look in the file's TStreamerInfo and update uproot.classes
.
Finding this issue from here. Is it still the case that uproot is unable to read a branch of TH1Ds? Are there any workarounds that do not involve reading the files with ROOT to extract the TH1D information (which defeats the point of using uproot in the first place)?
I don't think it's TH1D
specifically; one of our unit tests reads a TTree TBranch of TH1F
.
I don't know what determines whether ROOT writes the objects with memberwise splitting or not, but we can support one and not the other. (Memberwise splitting is very different and will require another deep dive into reverse-engineering the binary format.)
Okay, so I guess for some unknown reason the branch of TH1D is being written with memberwise splitting in the files I'm using. For now will have to work with ROOT directly then, at least as far as converting / rewriting into a format uproot can read.
Issue #1190 has an example of TH1Fs whose TAxis members are memberwise split and TH2Fs whose TAxis members are not memberwise split.
That will be useful to anyone who is willing to try to implement memberwise splitting.
Consider the weird serialization in scikit-hep/uproot#373, scikit-hep/uproot#374, scikit-hep/uproot#403, scikit-hep/uproot#475, and scikit-hep/uproot#495. It's field-at-a-time inside of each entry. I had thought it was Boost-inside-ROOT, but not for most of the above. It may be some ROOT serialization mode that I'm unaware of.