scikit-hep / uproot3

ROOT I/O in pure Python and NumPy.
BSD 3-Clause "New" or "Revised" License
314 stars 67 forks source link

Unimplemented TStreamerSTL map<long,int>, set<long>? #283

Closed rcross2 closed 5 years ago

rcross2 commented 5 years ago

I'm trying to read a ROOT file and I can't seem to access a custom class. I get an Unimplemented streamer type: TStreamerSTL error.

I think it's having trouble with the map<long,int> element, and if it gets past that I'm sure it will have trouble with the set<long>.

I've tried everything I can really think of to access these elements, really digging into the source code, but I don't understand the ROOT I/O format well enough to get anywhere.

Is there any way I can access this? I'm trying to make a script that transcribes this entire file, but this is the last element that I need to port to using uproot to remove the dependency on having to install ROOT and our custom class libraries to read our data.

Thank you very much for this package -- it has helped cut down ROOT file read times in python significantly!

tsi = c._context.streamerinfosmap[b'I3Eval_t']
tsi.show()
c._context.streamerinfosmap[b'TStreamerSTL'].show()
c['detector']

StreamerInfo for class: I3Eval_t, version=6, checksum=0x83729bdb
  TObject         BASE            offset=  0 type=66 Basic ROOT object
  NumberOfChannels int             offset=  0 type= 3 number of booked channels
  mGPSCardId      int             offset=  0 type= 3 GPS board ID (must be '0')
  mGPSPrescale    int             offset=  0 type= 3 the slices time width in 1/10 ns !!!
  mScalerCardId   int             offset=  0 type= 3 ID of the scaler board the all channel belongs to (must be '0')
  mScalerStartChannel int             offset=  0 type= 3 start position in array for later reading of data file
  MaxChannels     int             offset=  0 type= 6 maximal number of channels
  mMaxJitterLogs  int             offset=  0 type= 3 maximal number of logMessages written to logfile in case of clock jitter
  ChannelIDMap    map<long,int>   offset=  0 type=500 
  BadChannelIDSet set<long>       offset=  0 type=500 
  ChannelID       long*           offset=  0 type=44 [MaxChannels]
  Deadtime        double*         offset=  0 type=48 [MaxChannels]
  Efficiency      double*         offset=  0 type=48 [MaxChannels]

StreamerInfo for class: TStreamerSTL, version=3, checksum=0x8178ac3d
  TStreamerElement BASE            offset=  0 type= 0 Base class for one element (data member) to be Streamed
  fSTLtype        int             offset=  0 type= 3 type of STL vector
  fCtype          int             offset=  0 type= 3 STL contained type

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-104-d5db0b55468e> in <module>()
      3 tsi.show()
      4 c._context.streamerinfosmap[b'TStreamerSTL'].show()
----> 5 c['detector']
      6 
      7 for cls in c.iterclasses():

~/.venv3/lib/python3.5/site-packages/uproot/rootio.py in __getitem__(self, name)
    210 
    211     def __getitem__(self, name):
--> 212         return self.get(name)
    213 
    214     def __len__(self):

~/.venv3/lib/python3.5/site-packages/uproot/rootio.py in get(self, name, cycle)
    328 
    329             if last is not None:
--> 330                 return last.get()
    331             elif cycle is None:
    332                 raise _KeyError("not found: {0}\n in file: {1}".format(repr(name), self._context.sourcepath))

~/.venv3/lib/python3.5/site-packages/uproot/rootio.py in get(self, dismiss)
    888 
    889         try:
--> 890             return _classof(self._context, self._fClassName).read(self._source, self._cursor.copied(), self._context, self)
    891         finally:
    892             if dismiss:

~/.venv3/lib/python3.5/site-packages/uproot/rootio.py in read(cls, source, cursor, context, parent)
    830             context = context.copy()
    831         out = cls.__new__(cls)
--> 832         out = cls._readinto(out, source, cursor, context, parent)
    833         out._postprocess(source, cursor, context, parent)
    834         return out

~/.venv3/lib/python3.5/site-packages/uproot/rootio.py in _readinto(cls, self, source, cursor, context, parent, asclass)

~/.venv3/lib/python3.5/site-packages/uproot/rootio.py in _raise_notimplemented(streamertype, streamerdict, source, cursor)
    626 
    627 def _raise_notimplemented(streamertype, streamerdict, source, cursor):
--> 628     raise NotImplementedError("\n\nUnimplemented streamer type: {0}\n\nmembers: {1}\n\nfile contents:\n\n{2}".format(streamertype, streamerdict, cursor.hexdump(source)))
    629 
    630 def _resolveversion(cls, self, classversion):

NotImplementedError: 

Unimplemented streamer type: TStreamerSTL

members: {'_fMaxIndex': array([0, 0, 0, 0, 0], dtype=int32), '_fXmin': 0.0, '_fName': b'ChannelIDMap', '_fArrayLength': 0, '_fXmax': 0.0, '_fCtype': 4, '_classversion': 4, '_fSTLtype': 4, '_fTypeName': b'map<long,int>', '_fTitle': b'', '_fOffset': 0, '_fSize': 48, '_fFactor': 0.0, '_fArrayDim': 0, '_fType': 500}

file contents:

00000054  40 00 f1 ec 40 09 00 00  cb da aa 44 00 00 14 28  |@...@......D...(|
00000074  00 00 00 0a da 54 40 68  00 00 00 0b 92 9e 85 40  |.....T@h.......@|
00000114  00 00 00 16 e4 7e d4 7d  00 00 00 1a 68 30 55 54  |.....~.}....h0UT|
00000134  00 00 00 24 30 b7 b5 56  00 00 00 38 88 25 b7 a6  |...$0..V...8.%..|
00000154  00 00 00 3f 05 16 a9 7d  00 00 00 45 be d2 e4 d1  |...?...}...E....|
00000174  00 00 00 67 14 f8 51 d4  00 00 00 79 dc 9a f5 e0  |...g..Q....y....|
00000214  00 00 00 88 40 a4 47 15  00 00 00 8b 8c d2 4d 58  |....@.G.......MX|
00000234  00 00 00 8f 8f 9c fb 53  00 00 00 91 b9 99 b3 a1  |.......S........|
00000254  00 00 00 b9 db da 4c ff  00 00 00 bf 43 2c 9b f4  |......L.....C,..|
00000274  00 00 00 cc 71 21 31 7b  00 00 00 e9 32 c6 7c 2d  |....q!1{....2.|-|
jpivarski commented 5 years ago

Short answer: STL maps and sets have indeed not been implemented. However, they may be doable, particularly since the content types are so simple (ints and longs). Could you post the file?

rcross2 commented 5 years ago

I will see if I can generate a minimal file using our classes that recreate the error. Thank for the fast response!

rcross2 commented 5 years ago

@jpivarski https://send.firefox.com/download/4a96c974b069e1df/#-m-Z7nP80Q3Ca4su4vbK3A

This link is good for 1 download, let me know if you have trouble grabbing it.

jpivarski commented 5 years ago

It took me a moment to realize that you're trying to find this data in a non-TTree object. If PyROOT will work for you, you'll probably want to use that.

Nevertheless, I looked into it. The thing that is giving me the most trouble isn't in your printed error output—it's the Channel (I3Eval_t::ChannelContainer_t* that appears before te map<long,int> in the new version of your software, in the file you sent me but not the one you've been working with. It's a case where a new class type is introduced but no data.

After that, I believe that the serialization of the map<long,int> is 4 bytes ???, 8 bytes key, 4 bytes value. The keys and values are sorted, and there are a lot of them, like 3870 or so? It's the whole geometry of your detector. (IceCube? Are you looking for supernovae?) To figure out this new type, I'd need some guidance about what to expect—asking questions, like what numbers seem reasonable, etc. That process is doable.

But before getting into that, and seeing that the new Channel is itself an issue, I'd like to ask again, do you really need uproot to read this? Isn't this readable with PyROOT? And even if there are thousands of geometry elements, that's not a very large number—the performance of PyROOT probably isn't an issue.

If you have control over how this file gets written (to avoid the Channel, or at least to put it later in the class, for instance), then you might as well put the data in a TTree, where it would be easy to read. So I'm stopping now to be sure that you really need it.

Thanks!

rcross2 commented 5 years ago

Yeah this is a touch more confusing than I thought. This is some really legacy data and through the years there have been some mistakes when editing the streamer classes.

So I can access the data with PyROOT, but I have everything else written in uproot, just this one little structure I can't get. A real barrier for people to analyze our data is the installation of ROOT and the custom class libraries. We are trying to move the data to hdf5 at some point, but we're not there yet. I discovered your library and everyone that I've talked to about it are ecstatic that someone has re-implemented ROOT I/O in python. I just discovered it yesterday and we are all getting a lot of use out of it.

As for the issues with this file (I've re-written this a couple times as I am trying to work out what on earth they were doing when they designed this structure):

So in reality I think I can just grab this data if only I could load the config/detector class and ignore the pieces I don't need... unless the pieces I don't need affect where and how the other data are packed.

The keys and values are sorted, and there are a lot of them, like 3870 5160 or so? It's the whole geometry of your detector. (IceCube? [yep] Are you looking for supernovae? [yep])

jpivarski commented 5 years ago

Thanks for the information. The I3Eval_t::ChannelContainer_t* is particularly confusing, from the raw bytes end of things. I saw the type name in the data, but it was a zero-terminated string, rather than a size-followed-by-data string. Zero-terminated strings happen in only one place in ROOT I/O—a new class tag—but the Channels had at that point already been read in as nullptr (legally, too). So that was the first time I've encountered a byte pattern like that. The fact that it happened by mistake, some unintended confusion of streamers, explains a lot.

These complications wouldn't be a show-stopper, but the fact that different batches of your data are serialized differently has the potential of becoming a deep rabbit hole. We might get it working on one case, then keep encountering others until we give up later. (Better to give up early! :)

Since you only want some numbers and a mapping from numbers to numbers in config/detector, maybe you could introduce HDF5 files for just the geometry and bring over the measurement data later. Since this object is describing geometry, it's more metadata than data. (Does it even change? If there's only a few thousand values for your entire dataset—heck, it could be JSON. CMS "good run/lumi section" lists are in JSON.)

If you do use HDF5 to describe the map<long, int>, consider using a sparse array. The COO representation in one dimension (vector, not matrix) is just a sorted (or otherwise indexed) array of channel IDs (long) and a corresponding array of the values they map to (int). An attempt to access sparsearray[channel_id] should do a log-N bisection search (or other smart index lookup) for the channel_id, find it, and return the corresponding value. If the lookup value isn't really a channel, it would return 0 because it's sparse.

If don't know if this is what HDF5 does, exactly, but HDF5 is big on sparse arrays and it's one of my favorite hacks to reinterpret sparseness as integer lookup.

Good luck finding the next 1987A!

rcross2 commented 5 years ago

Ah well, you're right, it's not a show-stopper, the structures are still readable with pyROOT. We should be able to deal with it! Thanks for the help :) We will surely spread the word about uproot :+1:

sbinet commented 5 years ago

I apologize for the intrusion but I'd be willing to give this a try for Go-HEP and Groot (the other library that reads ROOT files w/o ROOT.)

Would you mind sending me a link to the file?

jpivarski commented 5 years ago

Good luck, @sbinet! An uproot+Groot (or just Groot) based workflow also solves the problem of installation for new users. I don't know how well Python and Go mix (if the final workflow doesn't use pure Go), but there's probably a good bridge out there somewhere.

As a suggestion, if possible: try to get both versions of the file, with and without the I3Eval_t::ChannelContainer_t* field. I hope it works out for you!

And now that I'm thinking about installation difficulty (@rcross2's cited reason against ROOT), note that ROOT can be installed through Conda now, too (in the conda-forge channel).

sbinet commented 5 years ago

@jpivarski creating a CPython{2,3} (or PyPy) module from a Go package is relatively easy thanks to go-python/gopy (a SWIG-like code generator command for Go). The generated Python extension module will only need libc and ctypes. see:

but, to reiterate: I'd like to see whether groot performs. could you (@rcross2 ) send me a link to that file? (or a file that exhibits the same issue.) thanks.