Closed shahidzk1 closed 9 months ago
These are variable-length arrays of Double32, but that ought to be supported already. Please include an example ROOT file and I'll look into what's preventing it from being identified.
You can attach a small ROOT file to a GitHub issue by renaming the extension to .txt
first. (GitHub doesn't like .root
as a filename extension, but doesn't complain about text files containing binary data.)
okay I have attached a similar file, where an analysis tree has under the VtxTracks branch channels_
but uproot doesn't recognize it at all
The file is more than 50mb so I have stored it on the following google drive, kindly download it there https://drive.google.com/drive/folders/1dXSFcRlWXvzeIUbNMSZQEUaeWSBy75_J?usp=sharing
Okay, first off, you're using Uproot 3 and Uproot 4 is more complete in terms of type coverage. (I should have recognized Uproot 3's show
format from your screenshots.) pip install -U uproot
should do it, or maybe you need to require "uproot>=4.0.0"
to convince pip to give you the latest version. There are significant differences between 3.x and 4.x.
Most of the examples with None
in your screenshots are actually groups of branches that do not have any data in themselves—ROOT only puts them there for structure. Nevertheless, Uproot 4 interprets these non-data branches by reading all of their subbranches and presenting them as some kind of group. (I.e. if library="ak"
(default), then the group is an Awkward record array, if library="np"
, then the group is a dict of NumPy arrays, if library="pd"
, then the group is a Pandas DataFrame instead of a Series.)
I did manage to find a bug: some of your vector<float>
and vector<bool>
were being interpreted as vector<int>
because they all have the same name, field_
, in their C++ structure and the interpretation took the first name that matched. I've tightened that rule by requiring the field name and the parent name to both match, and that tighter rule is satisfied by all other ROOT files in my tests. (Fixed in #272.)
Finally, scanning all the branches with
>>> for x in tree.keys():
... try:
... a = tree[x].array()
... except Exception as err:
... print(x, type(err), str(err))
... else:
... print(x, a)
the only failures left are branches with ROOT's "memberwise splitting." This is a feature I only found out about after the transition to Uproot 4, and it's a big to-do item: #38. For years, I had occasionally run into ROOT files with this strange serialization and it took a long time to even realize that it's a ROOT feature—I thought people were using custom streamers. One was (scikit-hep/uproot3#373); they were using Boost.Serialization, which biased me to think that they all were.
Now, though, I've found the bit that indicates that an object is serialized in a memberwise format so that I can figure out how to deserialize it, and in the meantime raise NotImplementedError. Your file has a lot of memberwise data, mostly std::vectors
, which would be a good start because they're fairly simple. However (see #38), it's only the 7th file I've ever encountered with this feature. It's rare!
@jpivarski thank you very much for your in-depth overview of my file.
upgraded to uproot 4 using pip install -U uproot , although it was already there but somehow wasn't being used
These sub-branches have leaves inside them and I want to upload them using uproot, could you tell me how can I do that? For example the leaf px_
tried with this code
gives the following error
@viktorklochkov What do you think?
Maybe you could also try to update CbmRoot and test with some new files. I've simplified the format in the last version, maybe that will help.
"awkward has no attribute 'layout'" is suggestive that you have Awkward version 0.x, when you want Awkward 1.x. try pip install -U awkward
to update Awkward as you already have updated Uproot.
@jpivarski also what happened to parallel processing in uproot 4 arrays?
@jpivarski installing awkward only changed the error message
For parallel processing, the arguments are named decompression_executor
and interpretation_executor
:
https://uproot.readthedocs.io/en/latest/uproot.behaviors.TTree.TTree.html#arrays
You can also pass in decompression_executor
and interpretation_executor
when you open a file:
https://uproot.readthedocs.io/en/latest/uproot.reading.open.html
so that it applies to all arrays from that file.
That website has all of the argument lists of all functions. The decompression_executor
is used to parallelize decompression, which is often useful because the zlib
, lz4
, lzma
, and zstd
libraries release the Python GIL when they run. The interpretation_executor
is used to parallelize interpretation, which is converting an uncompressed buffer of bytes into arrays. In most cases, this uses Python code and doesn't parallelize well.
As for the "NotImplementedError: memberwise serialization" error message, that's what I was talking about above: issue #38 is for me to solve and implement memberwise deserialization. Uproot 3 couldn't read that data layout, either—in fact, it was a mystery to me, as it appeared in only a small number of files over the years. It wasn't until July 2020 (the date of that issue) that I found out it's a ROOT feature, an alternate way of writing files that only a few people have ever switched on.
@jpivarski then I have been playing with the right parameters but the problem is that before I was able to do parallel processing and it was also memory efficient.
`from concurrent.futures import ThreadPoolExecutor executor = ThreadPoolExecutor(8)
branches = pd.DataFrame.from_dict(uproot.open(''+file_with_path+'')[''+tree_name+''].arrays(namedecode='utf-8', executor = executor))`
But now it consumes all my memory, may be I am not doing it properly. Could you please have a look at it?
`from concurrent.futures import ThreadPoolExecutor executor = ThreadPoolExecutor(8)
input_tree = uproot.open('/home/shahid/cbmsoft/Data/10k_events_PFSimplePlainTree.root:PlainTree', decompression_executor=executor)
branches = input_tree.arrays(library='pd', decompression_executor=executor)`
In this thread, we've been talking about several different files. The one you provided couldn't have been read by Uproot 3, so you're not talking about that one. (The one you posted above has trees named "Configuration," "aTree," and "DataHeader," not "PlainTree.")
But anyway, I assume that what you mean by "memory efficient" is that it worked in Uproot 3 but is running out of memory in Uproot 4. From all that I know right now, the difference might be some 10% but it's 10% more than you have available. The whole idea of parallel processing is to trade memory for speed, since you're asking for 8 times the working space to be used at once. Maybe Uproot 3 wasn't as well parallelized—the biggest 3 → 4 difference is that the low-level physical layer (getting bytes from files) was streamlined to use knowledge of which TBaskets you will be reading to request all the bytes while others are in flight—Uproot 3 could have been prevented from using 8 times the working memory, as requested, because it was waiting for data from the file. (Without replicating the process in some performance-tuning diagnostic, we can only speculate about what's actually happening, but using more memory while parallel processing can be good news, rather than bad.)
The uproot.TTree.arrays method that you're calling isn't specifying filter_name
or expressions
, so it's reading everything into memory—do you need all branches in the DataFrame? That kind of question is often more fruitful than carefully tuning performance with a memory profiler. If you only use half the variables, that's an easy factor of two.
This is becoming a discussion that is unrelated to your original problem of not being able to read certain branches. Maybe ask on a GitHub Discussion or StackOverflow with the [uproot]
tag? (I'm hoping other users will be able to help with these usage questions.)
It looks like this issue was taken up in #277 and solved. If I'm mistaken, please open a new issue with whatever is still remaining to be solved. Thanks!
I have a complicated tree with branches and sub branches. Some of the branches are recognized and picked by uproot while others aren't. For exapmle the momentum distribution fPx branch.
Pyroot shows that they are Double32_t type branches
Could you help me some how convert these sub branches to arrays?