scikit-hep / uproot5

ROOT I/O in pure Python and NumPy.
https://uproot.readthedocs.io
BSD 3-Clause "New" or "Revised" License
238 stars 77 forks source link

hadd on files written with uproot creates a file with broken branches #756

Open JohanWulff opened 2 years ago

JohanWulff commented 2 years ago

Each of the provided two .root files can be opened perfectly fine on their own:

In [1]: import uproot

In [2]: f = uproot.open("./FD1F1FC5-0A2F-6445-B49F-BE0DE70B41B9_MA.root")

In [3]: f['tout']['Muon_pt'].array()
Out[3]: <Array [] type='0 * float64'>

In [4]: g = uproot.open("./FDF4838A-7644-014B-B2CD-1B2747CC43C3_MA.root")

In [5]: g['tout']['Muon_pt'].array()
Out[5]: <Array [32.5, 37.4, 34.1] type='3 * float64'>

after

hadd` test.root FD1F1FC5-0A2F-6445-B49F-BE0DE70B41B9_MA.root FDF4838A-7644-014B-B2CD-1B2747CC43C3_MA.root

which generates the error Error in <TBranch::AddBasket>: An out-of-order basket matches the entry number of an existing basket., the TTree 'tout' of the resulting file is faulty:

In [1]: import uproot

In [2]: f = uproot.open("./test.root")

In [3]: f['tout']['Muon_pt'].array()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [3], line 1
----> 1 f['tout']['Muon_pt'].array()

File ~/anaconda3/envs/shep/lib/python3.10/site-packages/uproot/behaviors/TBranch.py:2208, in TBranch.array(self, interpretation, entry_start, entry_stop, decompression_executor, interpretation_executor, array_cache, library)
   2202             for (
   2203                 basket_num,
   2204                 range_or_basket,
   2205             ) in branch.entries_to_ranges_or_baskets(entry_start, entry_stop):
   2206                 ranges_or_baskets.append((branch, basket_num, range_or_basket))
-> 2208 _ranges_or_baskets_to_arrays(
   2209     self,
   2210     ranges_or_baskets,
   2211     branchid_interpretation,
   2212     entry_start,
   2213     entry_stop,
   2214     decompression_executor,
   2215     interpretation_executor,
   2216     library,
   2217     arrays,
   2218     False,
   2219 )
   2221 _fix_asgrouped(
   2222     arrays, expression_context, branchid_interpretation, library, None
   2223 )
   2225 if array_cache is not None:

File ~/anaconda3/envs/shep/lib/python3.10/site-packages/uproot/behaviors/TBranch.py:3493, in _ranges_or_baskets_to_arrays(hasbranches, ranges_or_baskets, branchid_interpretation, entry_start, entry_stop, decompression_executor, interpretation_executor, library, arrays, update_ranges_or_baskets)
   3490     pass
   3492 elif isinstance(obj, tuple) and len(obj) == 3:
-> 3493     uproot.source.futures.delayed_raise(*obj)
   3495 else:
   3496     raise AssertionError(obj)

File ~/anaconda3/envs/shep/lib/python3.10/site-packages/uproot/source/futures.py:36, in delayed_raise(exception_class, exception_value, traceback)
     32 def delayed_raise(exception_class, exception_value, traceback):
     33     """
     34     Raise an exception from a background thread on the main thread.
     35     """
---> 36     raise exception_value.with_traceback(traceback)

File ~/anaconda3/envs/shep/lib/python3.10/site-packages/uproot/behaviors/TBranch.py:3463, in _ranges_or_baskets_to_arrays.<locals>.basket_to_array(basket)
   3460 basket = None
   3462 if len(basket_arrays) == branchid_num_baskets[branch.cache_key]:
-> 3463     arrays[branch.cache_key] = interpretation.final_array(
   3464         basket_arrays,
   3465         entry_start,
   3466         entry_stop,
   3467         branch.entry_offsets,
   3468         library,
   3469         branch,
   3470     )
   3471     # no longer needed, save memory
   3472     basket_arrays.clear()

File ~/anaconda3/envs/shep/lib/python3.10/site-packages/uproot/interpretation/numerical.py:88, in Numerical.final_array(self, basket_arrays, entry_start, entry_stop, entry_offsets, library, branch)
     86     local_stop = stop - start
     87     basket_array = basket_arrays[basket_num]
---> 88     output[: stop - entry_start] = basket_array[local_start:local_stop]
     90 elif start <= entry_stop <= stop:
     91     local_start = 0

ValueError: could not broadcast input array from shape (0,) into shape (1,)

FD1F1FC5-0A2F-6445-B49F-BE0DE70B41B9_MA.root.txt FDF4838A-7644-014B-B2CD-1B2747CC43C3_MA.root.txt

Moelf commented 2 years ago

duplicate of

what version of uproot are you on

jpivarski commented 2 years ago

This is a further elaboration of the problem, or a different but similar problem, or something. It's following up on Gitter (@johanwulff is the same author).

JohanWulff commented 2 years ago

So the uproot version is 4.3.7 so it should include the latest bugfix which also explains why the error message is different now.

jpivarski commented 8 months ago

Since this issue is about something wrong in the way Uproot writes files, we'd need the file-writing step to be part of the reproducer. I just tried it in the latest Uproot (5.3.2 == main) and there's no error:

>>> import uproot
>>> import numpy as np
>>> f = uproot.recreate("one.root")
>>> f["tree"] = {"branch": np.array([], dtype=np.float64)}
>>> g = uproot.recreate("two.root")
>>> g["tree"] = {"branch": np.array([32.5, 37.4, 34.1], dtype=np.float64)}
% hadd test.root one.root two.root 
hadd Target file: test.root
hadd compression setting for all output: 1
hadd Source file 1: one.root
hadd Source file 2: two.root
hadd Target path: test.root:/
>>> import uproot
>>> uproot.open("test.root")["tree"].arrays().show(type=True)
type: 3 * {
    branch: float64
}
[{branch: 32.5},
 {branch: 37.4},
 {branch: 34.1}]

Or... not? The file looks odd in ROOT 6.30/04:

>>> import ROOT
>>> f = ROOT.TFile("test.root")
>>> t = f.Get("tree")
>>> t.Scan()
************************
*    Row   * branch.br *
************************
*        0 *         0 *
*        1 *      37.4 *
*        2 *      34.1 *
************************
3

whereas the two files, individually, look okay:

>>> import ROOT
>>> f = ROOT.TFile("one.root")
>>> t = f.Get("tree")
>>> t.Scan()
************************
*    Row   * branch.br *
************************
************************
0
>>> import ROOT
>>> f = ROOT.TFile("two.root")
>>> t = f.Get("tree")
>>> t.Scan()
************************
*    Row   * branch.br *
************************
*        0 *      32.5 *
*        1 *      37.4 *
*        2 *      34.1 *
************************
3

This is not the original issue (a lot has changed since then; maybe something came along that fixed it), but it's a new one or a related one.


Digging a little deeper, we see the expected low-level data:

>>> import uproot
>>> branch = uproot.open("test.root")["tree"]["branch"]
>>> branch.num_baskets
2
>>> branch.basket(0).data
array([], dtype=uint8)
>>> branch.basket(1).data
array([ 64,  64,  64,   0,   0,   0,   0,   0,  64,  66, 179,  51,  51,
        51,  51,  51,  64,  65,  12, 204, 204, 204, 204, 205], dtype=uint8)
>>> branch.basket(1).data.view(">f8")
array([32.5, 37.4, 34.1], dtype='>f8')

For this simple data type (double per entry), the TBasket consists entirely of numeric data (after the TKey and basket header), no offsets or anything like that, so it should be a raw dump of the numbers, as we have here.

It seems to be something about empty TBaskets, since a similar example in which the first file is non-empty results in

Python 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import uproot
>>> import numpy as np
>>> f = uproot.recreate("uno.root")
>>> f["tree"] = {"branch": np.array([3.14], dtype=np.float64)}
>>> g = uproot.recreate("dos.root")
>>> g["tree"] = {"branch": np.array([32.5, 37.4, 34.1], dtype=np.float64)}
>>> 
% hadd test2.root uno.root dos.root 
hadd Target file: test2.root
hadd compression setting for all output: 1
hadd Source file 1: uno.root
hadd Source file 2: dos.root
hadd Target path: test2.root:/
% python
Python 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import ROOT
>>> f = ROOT.TFile("test2.root")
>>> t = f.Get("tree")
>>> t.Scan()
************************
*    Row   * branch.br *
************************
*        0 *      3.14 *
*        1 *      32.5 *
*        2 *      37.4 *
*        3 *      34.1 *
************************
4

So:

  1. if a file has an empty TBasket, hadd takes that empty TBasket as-is
  2. when Uproot sees a file with an empty TBasket, it concatenates it as an empty list (here)
  3. when ROOT sees a file with an empty TBasket, it does something that overlays a zero on the first entry of the next basket

Should empty TBaskets be allowed? Maybe they're not and hadd is assuming that all input files have no empty TBaskets, and so is ROOT on read-back. But Uproot is considering empty TBaskets as just empty arrays to concatenate. That could be the cause of a mismatch in assumptions.

Is that the problem here? Should we not write data instead of writing an empty TBasket?

jpivarski commented 8 months ago

@JohanWulff, did you produce these files in a way that is different from how I made one.root and two.root?