Closed gordonwatts closed 4 months ago
Here is a small root file from the system - lets see how to figure out what the compression is here. sx_output.zip
import uproot
with uproot.open("sx_output.root") as f:
tree = f["atlas_xaod_tree"]
for branch in tree.keys():
print(tree[branch].compression, branch)
# read for long enough (given the small file) so it also appears in a profiler
for _ in range(100):
tree.arrays(tree.keys())
All branches in this file show up as ZLIB(1)
. As a sanity check I also ran the code above through pyinstrument
and the thing I see come up in the stack is _DecompressZLIB.decompress
in uproot
. Then to be fully sure, I remembered we have a nice script written by @jpivarski as ultimate source of truth:
import sys
import uproot
filename, treename, branchname_filter = sys.argv[1:4]
lookup = {
uproot.ZLIB._2byte: uproot.ZLIB.name,
uproot.LZMA._2byte: uproot.LZMA.name,
uproot.LZ4._2byte: uproot.LZ4.name,
uproot.ZSTD._2byte: uproot.ZSTD.name,
b"CS": "ancient!",
}
with uproot.open(filename) as file:
tree = file[treename]
for branchname, branch in tree.items(filter_name=branchname_filter):
print(f"{branchname = }")
basket_seek = branch.member("fBasketSeek")[: branch._num_normal_baskets]
for i, start in enumerate(basket_seek):
if branch.basket_compressed_bytes(i) == branch.basket_uncompressed_bytes(i):
print(f" {i} uncompressed")
else:
stop = start + uproot.reading._key_format_big.size
key_cursor = uproot.source.cursor.Cursor(start)
key_chunk = tree.file.source.chunk(start, stop)
key = uproot.reading.ReadOnlyKey(
key_chunk, key_cursor, {}, tree.file, branch, read_strings=False
)
data_cursor = uproot.source.cursor.Cursor(start + key.fKeylen)
data_chunk = tree.file.source.chunk(
data_cursor.index, data_cursor.index + 2
)
print(
f" {i} compressed with {lookup[data_chunk.raw_data.tobytes()]}"
)
and running
python check-TBasket-compression.py sx_output.root atlas_xaod_tree "jet*"
confirms ZLIB compression.
An amazingly through response, thanks @alexander-held !!!
Given that it looks like the compression algorithm has a large effect on performance (CPU, disk space), what algorithm is being used by ATLAS's EventLoop ntuple outptu code?