scikit-hep / uproot3

ROOT I/O in pure Python and NumPy.
BSD 3-Clause "New" or "Revised" License
314 stars 67 forks source link

Processing multiple root files #539

Open ico1036 opened 3 years ago

ico1036 commented 3 years ago

What is the most efficient way to deal with multiple root files (~100G) in uproot3 and uproot4? I cannot find tutorial about this.

I tried the lazy array but it takes a lot of time.

# PATH
dir_path = "/x4/cms/dylee/Delphes/data/root/signal/*/*.root"
file_list = glob.glob(dir_path)

# IO
cache = uproot.ArrayCache("2 GB")
events = uproot.lazyarrays(file_list, "Delphes", ['Electron*',"Muon*","Photon*","MissingET*"],cache=cache)

# Define Particle arrays
Electron = ak.zip(
    {
        "PT": events["Electron.PT"],
        "Eta": events["Electron.Eta"],
        "Phi": events["Electron.Phi"],
        "T": events["Electron.T"],
        "Charge": events["Electron.Charge"],
    }
)

Also, I tried the iterator but I'm not sure this loop-based method is efficient (https://github.com/JW-corp/J.W_Analysis/blob/main/Uproot/test/big_data.py)

Thanks.

jpivarski commented 3 years ago

Lazy arrays are good for interactive exploration, but the most efficient way to process multiple files with Uproot only is uproot.iterate (because it ensures that only a manageable amount of data is in memory at once).

I say "using Uproot only" because if you have a very large number of files, you'll want to distribute the job and run it in parallel. Uproot doesn't do that (as it's strictly an I/O library). Coffea Processors are a convenient way to do it on HEP.

ico1036 commented 3 years ago

Lazy arrays are good for interactive exploration, but the most efficient way to process multiple files with Uproot only is uproot.iterate (because it ensures that only a manageable amount of data is in memory at once).

I say "using Uproot only" because if you have a very large number of files, you'll want to distribute the job and run it in parallel. Uproot doesn't do that (as it's strictly an I/O library). Coffea Processors are a convenient way to do it on HEP.

Thank you very much! I tested this script and checked following results: