scikit-hep / uproot5

ROOT I/O in pure Python and NumPy.
https://uproot.readthedocs.io
BSD 3-Clause "New" or "Revised" License
233 stars 73 forks source link

Trying to parse the root file using rust and asking some questions. #1261

Open Esword618 opened 1 month ago

Esword618 commented 1 month ago

I found that python parsing the root file is a bit slow, I would like to try to use rust to parse it, and then give it back to python bindings, but I don't know how to start parsing the root file, what is the recommended information. (I'm not a deep root scholar)

jpivarski commented 1 month ago

This is an interesting, though large, undertaking. We talked about "compiled Uproot" options at last year's PyHEP.dev: https://github.com/HSF/PyHEP.dev-workshops/issues/15. In 2018, I knew that @cbourjau was working on a ROOT file reader in Rust, under alice-rs/root-io, and it looks like that work was still active last year.

If you want to accelerate hotspots in Uproot with Rust, it will take some thinking to figure out where we can insert calls into compiled code, get meaningful results back to Python in a format Python can use, and have the whole process actually be faster. I'm not against this, but we'd have to be judicious about it or we might not have any bottom-line speed-up to show for it.

One place where a similar speed-up was made was in complex TTree interpretation. TTrees with simple data types are just cast as NumPy or Awkward arrays (doesn't even need a traversal over the data, compiled or otherwise), but complex types used to be auto-generated Python code and are now auto-generated AwkwardForth code. AwkwardForth is another interpreted language (to allow for runtime code generation), but a faster one than Python (2102.13516 and 2303.02202).

All of the past focus on performance in Uproot addressed the bulk TTree contents, the part of the problem that scales as $\mathcal{O}(n)$ where $n$ is the number of entries in TTrees. Everything else has always been pure Python, in some cases auto-generated, in other cases a static implementation. From what I have seen, this $\mathcal{O}(1)$ part is where the remaining performance issues are. From the step in which the file is first opened, Uproot has to

I can see how a complete rewrite of Uproot in Rust would work, using PyO3 to expose the same Python interface. That kind of project would be in the style of flake8 → ruff, pip → uv, etc., and it would be a good idea if it were not such a massive undertaking (as well as making it more difficult to add small quality-of-life features: Rust code is harder to change than Python code).

It's harder to see how hotspot-fixes would work. Conda + libmamba is an example of that, in which only one part of conda, the expensive dependency resolution, was replaced. (Micromamba and pixi are different stories.) But there doesn't seem to be a good factorization point here: Python-Uproot will need Python classes to operate. Even if they can be interpreted much more quickly in Rust, they still have to be exposed as Python data to function.

Here's an idea for an isolated project that could be useful: the uproot.num_entries function does the absolute minimum needed to open a ROOT file containing a TTree and then report the number of entries in that TTree. It's something that we often need to do for large datasets, so it needs a fast-path. It's not tightly coupled to the rest of Uproot; it could be replaced by a Rust function without upsetting much. The implementation is here:

https://github.com/scikit-hep/uproot5/blob/90039eadb05a8df7420d42dfd05300fcd854623a/src/uproot/models/TTree.py#L944-L973

What do you think of using this as a starter project? You'll need to learn how to interpret TFile headers, TDirectory data, TKeys, and only a little TTree metadata (by examining Uproot, UnROOT.jl, groot (Go), and asking questions).

Esword618 commented 1 month ago

@jpivarski I want to convert the parsed data into pola-rs format because I want to process and filter the data with the help of pola-rs functionality. Because sometimes the root file is very big and not can't be parsed at once, it can be processed with the help of rust's iterative parsing, this is my primary idea, what do you think, any good suggestions?

cbourjau commented 1 month ago

alice-rs is build using nom parser combinators. It was a really fun project that I would have loved to push further but I have since left the field. I would certainly recommend such a toolstack if you'd like to go down that route. However, be warned that parsing root files may get very tricky (regardless of the language) if the data contained are custom, possibly generic, C++ classes. But hey: We are not doing this because it is easy but because we thought it would be easy ;)

Esword618 commented 1 month ago

@cbourjau Okay, thank you for your help.

denehoffman commented 1 month ago

I've looked into this a bit after finding @cbourjau 's work on alice-rs. There is another project called oxyroot by @m-dupont which I have used with varying levels of success (but it usually works fine). The major difficulty is of course the streamers - these are very difficult to implement in the same way uproot or unroot.jl does because Rust is not dynamically typed (among other reasons). It's also not that easy to write libraries like this because, as far as I can tell, the actual format of a ROOT file is poorly documented. Even the docs (click the dropdown near the top that says "ROOT file data format specification") doesn't really tell you much and in fact includes some errors (there is an undocumented two-byte gap between fNbytesInfo and fUUID, for example). The only way I've found to actually learn how to parse ROOT files is to either read the source code for TFiles, TTrees, streamers, etc. (it's pretty chaotic) or read someone else's code (like uproot) which actually lays it out in a clearer manner. If someone has a better resource for this, feel free to correct me, but it's a bit disappointing that the documentation on the file format used in 99% of particle physics analysis is so limited.

jpivarski commented 1 month ago

We should also mention hep/groot (implementation in Go). The code quality of this project is excellent, and it's where I learned enough about the ROOT format to write Uproot.

It's true that expressing data with TStreamers is inherently dynamic. The question of static types versus dynamic types is fundamental; it can't be brushed away with some tool. You either know enough about the data types at compile time to generate specialized code or you don't know until runtime. Uproot and UnROOT effectively do things like JIT-compilation (although Uproot's AwkwardForth is more of a fast interpreter than true JIT-compilation), and many earlier projects involved a phase in which you point a code-generator at a file and made some code to compile. ROOT itself uses JIT-compilation, which is why ROOT I/O can't be packaged separately from LLVM.

Without dynamic TStreamers, it wouldn't be possible to encode arbitrary data types. That's essential for some applications, though end-stage analysis workflows have been moving toward simple data types like NanoAOD and PHYSLITE.

A ROOT I/O project has to define some boundaries. Perhaps it can focus on NanoAOD/PHYSLITE-like files exclusively—then it doesn't need arbitrary data types, just the versions of TTree, TBranch, etc. from the last few years. At the start of the Uproot project (2017), I compiled a set of ROOT versions up to 5 years old (so, starting in 2012):

and found that this only spanned a few versions of the TTree class:

So Uproot now hard-codes them, just to avoid the time needed to read TStreamers and generate code for the common cases.

A Rust-based implementation could target these TTree (and TBranch, ...) versions and just raise an error when an unrecognized class version is encountered. The rate at which these things are added is not high.

denehoffman commented 1 month ago

Hard coding them seems like the way to go, I agree. I believe alice-rs had the interesting approach of using a separate script to generate rust code using a ROOT file as input and some fancy macros, but it's a bit buggy and as mentioned not under active development anymore (but a great place to learn a bit!). I've been trying my hand at making my own parser, if it ever comes to fruition I'll let you all know about it haha

cbourjau commented 1 month ago

It depends on the use case, but anything parsed with a dynamic streamer is essentially an object without any member functions. After all, if you were to know enough about the layout to implement member functions, you could just as well parse the data. Thus, I found for myself the streamers to be at best useful as hints for the layout, but ultimately, you need to know something about the layout to do anything useful with the parsed data anyhow.

Also, if cannot recommend nom enough for writing binary parsing code in Rust! It was a joy to use!

Alice-rs got a little of the rails. with the later commits, I must admit :D. I was trying some async stuff with the goal of running an analysis on Alice Open Data in the browser using WASM. But it made everything rather complicated...

jpivarski commented 1 month ago

It depends on the use case, but anything parsed with a dynamic streamer is essentially an object without any member functions.

Mostly. There's also a case in which you know about the class but you don't know a particular version of that class. For instance, suppose a new TTree class version comes out tomorrow, and it adds one more data member, fWhatever. We've written member functions to do interesting, useful things with all of the data members other than fWhatever but we need to generate new code to implement the new fWhatever-aware deserialization routine. Once deserialized, we should be able to continue using the member functions that make use of all of the other data members.

Uproot does this by splitting member functions and deserialization routines into "behaviors" and "models", which are essentially traits and concrete classes. I wonder if it would be possible to write dynamic deserialization routines in Rust, which fill a dynamic structure such as a HashMap and then copies or moves the deserialized data from that structure to a predefined class, taking only recognized member data (i.e. ignoring fWhatever)? Maybe nom?

denehoffman commented 1 month ago

I wonder if it would be possible to write dynamic deserialization routines in Rust, which fill a dynamic structure such as a HashMap and then copies or moves the deserialized data from that structure to a predefined class, taking only recognized member data (i.e. ignoring fWhatever)? Maybe nom?

In Rust you can do something like this with a HashMap of Box<dyn Trait> but abstracting over multiple types of behaviors is tough. In theory this can be accomplished with an enum containing those boxed traits, and this could be done separately for each main type with a map over versions if that's what you're getting at. Or use supertraits to fake the inheritance structures in ROOT. The aversion to OOP in Rust makes it tough to represent the same structures which are easy in C++ 🥲

jpivarski commented 1 month ago

The Models in Uproot are called "models" because they don't reproduce the OOP structure from C++ in Python, even though both are OOP languages. (It's because I didn't want the Python classes to have the same inheritance structure as the C++ classes. In earlier versions, it led to accidental inheritance of TH1 methods into TH2, for instance.)

In Uproot, the C++ superclasses of a modeled C++ class are in a Python list. It's a list to allow for multiple inheritance—distant ancestors are nested in these lists. Thus, we have a whole tree of method-resolution-order. That technique would port to Rust easily, though it means dynamically navigating a list for each access, rather than immediate access.

Since the Rust implementation is motivated by speed of metadata handling, all of these little traversals would add up. It's probably best to make the Rust implementation static, based on the few class-version combinations that one would find in NanoAOD/PHYSLITE files.


Remember that an uproot.num_entries implementation would be a limited-scope first project that would bring immediate value. Optimizing the time to get the number of TTree entries from a file is useful for setting partition boundaries in a Dask job or computing a quantity that would be a denominator or normalizing factor in some calculation. The fEntries member of TTree is near the beginning of its serialized form, before any of the class version-dependent differences appear, so this project could completely ignore TStreamers and dynamic classes.

Speaking of which, for large-scale data processing, we've been finding that it would be better if Uproot were split into two parts: one to crawl through the metadata and get TBasket positions and another to read, decompress, and interpret TBaskets in bulk. Tiled-Uproot is intended to put a database in between the two steps, so that a collection of ROOT files can be indexed for faster processing later.

The uproot.num_entries project described above is 90% of the way toward the first half of a split Uproot. The second half is not dependent on metadata handling (and @lgray has been exploring options to handle this half exclusively in GPUs).

Esword618 commented 1 month ago

Okay, thank you very much for your suggestions. My original intention was to use Rust to parse root files, then encapsulate them into Python libraries, and combine them with Python's PyTorch for machine learning. That's why I had this idea.

好的,非常感谢大家的建议,我的初衷是使用rust解析root文件,然后封装为python库,与python的pytorch结合,进行机器学习,因此有这样一个想法。