scikit-hep / uproot5

ROOT I/O in pure Python and NumPy.
https://uproot.readthedocs.io
BSD 3-Clause "New" or "Revised" License
238 stars 77 forks source link

Fields of a struct interpreted with the type of their siblings, and more cases of memberwise splitting #268

Closed shahidzk1 closed 9 months ago

shahidzk1 commented 3 years ago

I have a complicated tree with branches and sub branches. Some of the branches are recognized and picked by uproot while others aren't. For exapmle the momentum distribution fPx branch. Screenshot from 2021-02-15 18-09-58

Pyroot shows that they are Double32_t type branches Screenshot from 2021-02-15 18-13-08

Could you help me some how convert these sub branches to arrays?

jpivarski commented 3 years ago

These are variable-length arrays of Double32, but that ought to be supported already. Please include an example ROOT file and I'll look into what's preventing it from being identified.

You can attach a small ROOT file to a GitHub issue by renaming the extension to .txt first. (GitHub doesn't like .root as a filename extension, but doesn't complain about text files containing binary data.)

shahidzk1 commented 3 years ago

okay I have attached a similar file, where an analysis tree has under the VtxTracks branch channels_ Screenshot from 2021-02-16 10-10-15

but uproot doesn't recognize it at all Screenshot from 2021-02-16 10-10-54

The file is more than 50mb so I have stored it on the following google drive, kindly download it there https://drive.google.com/drive/folders/1dXSFcRlWXvzeIUbNMSZQEUaeWSBy75_J?usp=sharing

jpivarski commented 3 years ago

Okay, first off, you're using Uproot 3 and Uproot 4 is more complete in terms of type coverage. (I should have recognized Uproot 3's show format from your screenshots.) pip install -U uproot should do it, or maybe you need to require "uproot>=4.0.0" to convince pip to give you the latest version. There are significant differences between 3.x and 4.x.

Most of the examples with None in your screenshots are actually groups of branches that do not have any data in themselves—ROOT only puts them there for structure. Nevertheless, Uproot 4 interprets these non-data branches by reading all of their subbranches and presenting them as some kind of group. (I.e. if library="ak" (default), then the group is an Awkward record array, if library="np", then the group is a dict of NumPy arrays, if library="pd", then the group is a Pandas DataFrame instead of a Series.)

I did manage to find a bug: some of your vector<float> and vector<bool> were being interpreted as vector<int> because they all have the same name, field_, in their C++ structure and the interpretation took the first name that matched. I've tightened that rule by requiring the field name and the parent name to both match, and that tighter rule is satisfied by all other ROOT files in my tests. (Fixed in #272.)

Finally, scanning all the branches with

>>> for x in tree.keys():
...   try:
...     a = tree[x].array()
...   except Exception as err:
...     print(x, type(err), str(err))
...   else:
...     print(x, a)

the only failures left are branches with ROOT's "memberwise splitting." This is a feature I only found out about after the transition to Uproot 4, and it's a big to-do item: #38. For years, I had occasionally run into ROOT files with this strange serialization and it took a long time to even realize that it's a ROOT feature—I thought people were using custom streamers. One was (scikit-hep/uproot3#373); they were using Boost.Serialization, which biased me to think that they all were.

Now, though, I've found the bit that indicates that an object is serialized in a memberwise format so that I can figure out how to deserialize it, and in the meantime raise NotImplementedError. Your file has a lot of memberwise data, mostly std::vectors, which would be a good start because they're fairly simple. However (see #38), it's only the 7th file I've ever encountered with this feature. It's rare!

shahidzk1 commented 3 years ago

@jpivarski thank you very much for your in-depth overview of my file.

  1. upgraded to uproot 4 using pip install -U uproot , although it was already there but somehow wasn't being used

  2. These sub-branches have leaves inside them and I want to upload them using uproot, could you tell me how can I do that? For example the leaf px_ Screenshot from 2021-02-17 10-28-17

  3. tried with this code Screenshot from 2021-02-17 10-40-55

gives the following error

Screenshot from 2021-02-17 10-41-10

  1. This also doesn't help Screenshot from 2021-02-17 10-43-10
shahidzk1 commented 3 years ago

@viktorklochkov What do you think?

viktorklochkov commented 3 years ago

Maybe you could also try to update CbmRoot and test with some new files. I've simplified the format in the last version, maybe that will help.

jpivarski commented 3 years ago

"awkward has no attribute 'layout'" is suggestive that you have Awkward version 0.x, when you want Awkward 1.x. try pip install -U awkward to update Awkward as you already have updated Uproot.

shahidzk1 commented 3 years ago

@jpivarski also what happened to parallel processing in uproot 4 arrays? Screenshot from 2021-02-17 14-04-30

shahidzk1 commented 3 years ago

@jpivarski installing awkward only changed the error message Screenshot from 2021-02-17 14-14-29 Screenshot from 2021-02-17 14-14-42

jpivarski commented 3 years ago

For parallel processing, the arguments are named decompression_executor and interpretation_executor:

https://uproot.readthedocs.io/en/latest/uproot.behaviors.TTree.TTree.html#arrays

You can also pass in decompression_executor and interpretation_executor when you open a file:

https://uproot.readthedocs.io/en/latest/uproot.reading.open.html

so that it applies to all arrays from that file.

That website has all of the argument lists of all functions. The decompression_executor is used to parallelize decompression, which is often useful because the zlib, lz4, lzma, and zstd libraries release the Python GIL when they run. The interpretation_executor is used to parallelize interpretation, which is converting an uncompressed buffer of bytes into arrays. In most cases, this uses Python code and doesn't parallelize well.

As for the "NotImplementedError: memberwise serialization" error message, that's what I was talking about above: issue #38 is for me to solve and implement memberwise deserialization. Uproot 3 couldn't read that data layout, either—in fact, it was a mystery to me, as it appeared in only a small number of files over the years. It wasn't until July 2020 (the date of that issue) that I found out it's a ROOT feature, an alternate way of writing files that only a few people have ever switched on.

shahidzk1 commented 3 years ago

@jpivarski then I have been playing with the right parameters but the problem is that before I was able to do parallel processing and it was also memory efficient.

`from concurrent.futures import ThreadPoolExecutor executor = ThreadPoolExecutor(8)

branches = pd.DataFrame.from_dict(uproot.open(''+file_with_path+'')[''+tree_name+''].arrays(namedecode='utf-8', executor = executor))`

But now it consumes all my memory, may be I am not doing it properly. Could you please have a look at it?

`from concurrent.futures import ThreadPoolExecutor executor = ThreadPoolExecutor(8)

input_tree = uproot.open('/home/shahid/cbmsoft/Data/10k_events_PFSimplePlainTree.root:PlainTree', decompression_executor=executor)

branches = input_tree.arrays(library='pd', decompression_executor=executor)`

jpivarski commented 3 years ago

In this thread, we've been talking about several different files. The one you provided couldn't have been read by Uproot 3, so you're not talking about that one. (The one you posted above has trees named "Configuration," "aTree," and "DataHeader," not "PlainTree.")

But anyway, I assume that what you mean by "memory efficient" is that it worked in Uproot 3 but is running out of memory in Uproot 4. From all that I know right now, the difference might be some 10% but it's 10% more than you have available. The whole idea of parallel processing is to trade memory for speed, since you're asking for 8 times the working space to be used at once. Maybe Uproot 3 wasn't as well parallelized—the biggest 3 → 4 difference is that the low-level physical layer (getting bytes from files) was streamlined to use knowledge of which TBaskets you will be reading to request all the bytes while others are in flight—Uproot 3 could have been prevented from using 8 times the working memory, as requested, because it was waiting for data from the file. (Without replicating the process in some performance-tuning diagnostic, we can only speculate about what's actually happening, but using more memory while parallel processing can be good news, rather than bad.)

The uproot.TTree.arrays method that you're calling isn't specifying filter_name or expressions, so it's reading everything into memory—do you need all branches in the DataFrame? That kind of question is often more fruitful than carefully tuning performance with a memory profiler. If you only use half the variables, that's an easy factor of two.

This is becoming a discussion that is unrelated to your original problem of not being able to read certain branches. Maybe ask on a GitHub Discussion or StackOverflow with the [uproot] tag? (I'm hoping other users will be able to help with these usage questions.)

jpivarski commented 9 months ago

It looks like this issue was taken up in #277 and solved. If I'm mistaken, please open a new issue with whatever is still remaining to be solved. Thanks!