Closed tamasgal closed 4 years ago
Work on uproot 4 hasn't started, but a good way to get ready for it would be to use uproot as-is to make awkward0 arrays, then convert those to awkward1 with a conversion function that's almost finished: scikit-hep/awkward-1.0#135.
A bigger trouble is that the Awkward 1.0 deployment procedure is currently broken (completely revamped last weekend; not quite recovered yet), so you'll have to wait for the above PR, which is minutes away from being done, and the deployment procedure, which is a bit more open-ended. My intention is to get that fixed today. Or you can install from source, which doesn't rely on the deployment procedure working.
But you know, I do appreciate the offer and I'll point you to the Uproot 4 branch when I start working on it. According to the schedule, I should start working on it one month from now: April. March is for finishing Awkward and helping out with vector, the replacement for uproot-methods/TLorentzVector.
On Uproot, the necessary work will be replacing Awkward0 array generation with Awkward1 array generation, and Awkward arrays will become an "extras" dependency (not strictly required, but highly recommended). Thus, a "base" Uproot installation would only be able to serve NumPy arrays, just as a "base" installation can only decompress GZIP and LZMA (in Python 3), not LZ4 or ZSTD, since those packages are considered "extras." The recommended Uproot installation procedure would become
pip install uproot[all]
with non-[all]
for special cases; for running it in limited environments. [all]
would be equivalent to an installation via conda: all compression methods, Awkward, the requests library (HTTP), and XRootD (if possible) would be included. (If XRootD can't be distributed via pip, then it's not possible.)
I should probably do that surgery, but the Uproot 3 → 4 transition opens the possibility for other, minor compatibility-breaking changes. One thing is that Python 3 strings, rather than bytestrings, will be presented to the user everywhere (assuming utf-8 encoding with "surrogateescape," which doesn't fail on wrong encodings).
If you'd like to work on this or have other, minor interface changes, let's talk. Apart from the surgery of replacing Awkward0 with Awkward1, I'd like the main users of Uproot to contribute to the interface, since they know best what they want it to look like.
Thanks for the roadmap summary, sounds really promising.
I will have a look at the conversion function (https://github.com/scikit-hep/awkward-1.0/pull/135) but today I already successfully tossed those nasty object-type arrays into awkward1.Array
s and it works like a charm ;)
In [15]: import uproot
In [16]: f = uproot.open("tests/samples/aanet_v2.0.0.root")
In [17]: arr = f['E']['Evt']['trks']['trks.fitinf'].array()
In [18]: arr[:, 0]
Out[18]: <ChunkedArray [[0 0 0 ... 0 0 0] [0 0 0 ... 0 0 0] [0 0 0 ... 0 0 0] ... [0 0 0 ... 0 0 0] [0 0 0 ... 0 0 0] [0 0 0 ... 0 0 0]] at 0x00011a4a6390>
In [19]: awr = ak.Array(arr)
In [20]: awr[:, 0]
Out[20]: <Array [[0.00496, 0.00342, ... 1.84e+03, 54]] type='10 * var * float64'>
I am wondering though how uproot 4 will deal with lazyarrays in combination with awkward1
?
...another question: do you plan to drop Python 2 support in uproot 4? I guess it's about time 😉
So far I am very happy with uproot
as a low lever library. We managed to write a lightweight wrapper which is easily maintainable and offers a very natural, user-friendly way to access our ROOT data files. The split levels are still causing headaches for a particular branch in our dataset, but I guess I will postpone it with the transition to uproot 4 (https://github.com/scikit-hep/uproot/issues/433)
For lazyarrays: Awkward 1 needs ChunkedArrays and VirtualArrays to behave like Awkward 0. I've had a lot of conversions with @nsmith- about how to do that properly.
For Python 2 support: actually, with Uproot 4 not strictly depending on Awkward, we can relax its Python constraint to 2.6! I won't be advertising that, though. This would make it quietly work on all sorts of ancient systems, the will only officially support Python 2.7 and recent versions of Python 3. The latter is driven by Awkward's dependence on NumPy 1.13.1, which in turn has a minimum Python version of 2.7. we can only guarantee feature-completeness in Python 3, but for a limited set of features, with good error messages you tried to go beyond that set, it won't break in old versions.
Btw. (sorry for the late answer) my question regarding Python 2 was more like: I'd definitely ditch Python 2 support in favour of less maintenance work. NumPy also had some serious memory leaks with e.g. recarrays which totally broke our pipelines (we heavily use them in our data structures) and they were fixed in later versions not supporting Python 2 (https://github.com/numpy/numpy/issues/13853).
I was expecting that uproot4 will only support 3.5+. Don't you think that with this major leap it would be a good idea to get rid of legacy dependencies or is this a project requirement (probably driven by the use-cases in HEP)? In our collaboration (which is more astroparticle than HEP) we successfully managed to get rid of Python 2 dependencies, but we are only a few hundred people and use Python mostly for high level analysis.
On the Python 3 side, we'll be picking a rather high minimum, perhaps 3.5, like you said. Early Python 3 was volatile and hard to support.
But supporting Python 2.7 and even 2.6 is just a matter of not using certain idioms. Similarly, there's very little that we need from modern NumPy versions—Awkward needed NEP13, which comes in NumPy 1.13.1, but Awkward 1 is a bigger dependency and so it will be optional for Uproot (some users don't have jagged arrays). For everything else, we could go all the way back to NumPy 1.8 or so. There are circumstances where people need to work in such old versions (weird circumstances, like a DAQ machine not connected to the network or Python running on an iPad), and I'm only considering it because it's so easy—very little maintainence burden. It's because Uproot without Awkward depends on so little that the minimum Python and NumPy versions can be pushed back so far.
I would never recommend a user doing analysis to use Python 2, and old NumPy certainly could have issues like the memory leak that you mentioned. But this is the difference between application and library: in an application, use the latest versions to get the best software; in a library, depend on as little as you can to get the job done to avoid putting unnecessary constraints on applications. After all, somebody's going to want to open ROOT files on their iPad.
Yes I fully agree with you, thanks for sharing these thoughts.
At this very moment the most interesting feature at least for us is fancy indexing with nested data. We are trying our best to build wrapper classes around Uproot to make the user interface behave like awkward1.Arrays
in case of these "annoying" nested lists (https://github.com/scikit-hep/awkward-array/issues/229) but we have to sacrifice laziness, at least until now we did not find a way to map lazyarray
s of these ChunkedArrays
without invoking a full readout of the data.
So, it remains unclear how this will be integrated into Uproot; we can't wait to try it out 😄
Looking forward for Uproot 4! (and of course Awkward 1 😉)
Nice!
Sorry for my ignorance, maybe this is already mentioned somewhere but I couldn't find it. Is there a dev-branch for uproot 4? I am dealing with some nested awkward arrays and went through the awkward-1 resources (also https://github.com/scikit-hep/awkward-1.0/blob/master/docs/demos/2020-01-22-numba-demo-EVALUATED.ipynb which is impressive) and it seem that this will solve all my issues automatically (see https://github.com/scikit-hep/awkward-array/issues/229).
Anyways, I would be happy to try uproot 4 (alpha) and maybe also contribute if possible.