iris-hep / idap-200gbps-atlas

benchmarking throughput with PHYSLITE
6 stars 1 forks source link

What compression algorithm is being used by EventLoop default ROOT output. #79

Closed gordonwatts closed 4 months ago

gordonwatts commented 4 months ago

Given that it looks like the compression algorithm has a large effect on performance (CPU, disk space), what algorithm is being used by ATLAS's EventLoop ntuple outptu code?

gordonwatts commented 4 months ago

Here is a small root file from the system - lets see how to figure out what the compression is here. sx_output.zip

alexander-held commented 4 months ago
import uproot

with uproot.open("sx_output.root") as f:
    tree = f["atlas_xaod_tree"]
    for branch in tree.keys():
        print(tree[branch].compression, branch)

    # read for long enough (given the small file) so it also appears in a profiler
    for _ in range(100):
        tree.arrays(tree.keys())

All branches in this file show up as ZLIB(1). As a sanity check I also ran the code above through pyinstrument and the thing I see come up in the stack is _DecompressZLIB.decompress in uproot. Then to be fully sure, I remembered we have a nice script written by @jpivarski as ultimate source of truth:

import sys

import uproot

filename, treename, branchname_filter = sys.argv[1:4]

lookup = {
    uproot.ZLIB._2byte: uproot.ZLIB.name,
    uproot.LZMA._2byte: uproot.LZMA.name,
    uproot.LZ4._2byte: uproot.LZ4.name,
    uproot.ZSTD._2byte: uproot.ZSTD.name,
    b"CS": "ancient!",
}

with uproot.open(filename) as file:
    tree = file[treename]
    for branchname, branch in tree.items(filter_name=branchname_filter):
        print(f"{branchname = }")
        basket_seek = branch.member("fBasketSeek")[: branch._num_normal_baskets]
        for i, start in enumerate(basket_seek):
            if branch.basket_compressed_bytes(i) == branch.basket_uncompressed_bytes(i):
                print(f"    {i} uncompressed")
            else:
                stop = start + uproot.reading._key_format_big.size

                key_cursor = uproot.source.cursor.Cursor(start)
                key_chunk = tree.file.source.chunk(start, stop)
                key = uproot.reading.ReadOnlyKey(
                    key_chunk, key_cursor, {}, tree.file, branch, read_strings=False
                )

                data_cursor = uproot.source.cursor.Cursor(start + key.fKeylen)
                data_chunk = tree.file.source.chunk(
                    data_cursor.index, data_cursor.index + 2
                )

                print(
                    f"    {i} compressed with {lookup[data_chunk.raw_data.tobytes()]}"
                )

and running

python check-TBasket-compression.py sx_output.root atlas_xaod_tree "jet*"

confirms ZLIB compression.

gordonwatts commented 4 months ago

An amazingly through response, thanks @alexander-held !!!