Closed sznajder closed 1 year ago
Hello @sznajder. Thank you for the interest and apologies for the slow response.
Can you motivate your question? One does not typically chain "raw" outputs of MC programs, hence the question. Would it not be better and also more flexible to retrieve the info you want from every single LHE file, say in awkward arrays, likely with vectors in, and then merge at that level? Or save to ROOT files and chain there?
Just thinking/asking out loud.
Otherwise "anything" is possible. You could even have a go if interested.
Hi Eduardo, I’m interested in running a parton level analysis using LHE files generated with Madgraph. As the jobs were generated in batches of 1K events I have hundreds of files and it would be nice to be able to chain them. On can always merge the LHE files before reading them with PyLHE or as you mentioned merge the awkward arrays. Indeed, I’m currently merging the awkward arrays in my code. Cheers, Andre
On 27 Jun 2023, at 18:00, Eduardo Rodrigues @.***> wrote:
Hello @sznajder https://github.com/sznajder. Thank you for the interest and apologies for the slow response.
Can you motivate your question? One does not typically chain "raw" outputs of MC programs, hence the question. Would it not be better and also more flexible to retrieve the info you want from every single LHE file, say in awkward arrays, likely with vectors in, and then merge at that level? Or save to ROOT files and chain there?
Just thinking/asking out loud.
Otherwise "anything" is possible. You could even have a go if interested.
— Reply to this email directly, view it on GitHub https://github.com/scikit-hep/pylhe/issues/203#issuecomment-1610206073, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGMRNVBH7PQ2OX7QQGS2KLXNNCXTANCNFSM6AAAAAAZQRWZOY. You are receiving this because you were mentioned.
OK, understood.
Yes, merging LHE files seems tricky to me, but I would welcome comments from @matthewfeickert and @lukasheinrich.
Given the route you are following, which seems flexible to me, would you be interested in contributing a little notebook example inspired by what you are doing, as I'm sure that will be useful to others? We're happy to help out to get a few test files in scikit-hep-testdata and the PR itself. BTW, we do acknowledge contributors nicely, see https://github.com/scikit-hep/pylhe/blob/main/README.md#contributors :).
I've also run into this issue from trying to merge many LHE files into one sample for later analysis and I've come upon two solutions that work with current pylhe
.
Since LHE is a simple, text-based format, you can probably merge the LHE files by simply copying the <event>...</event>
blocks from one file into the end of previous files. You could do this manually or by using some shell-scripting nonsense. The reason this is risky and I changed to using option (2) below is that it drops all of the init/header information from the LHE files you are copying from. This at best drops information about which random seed is being used and at worst loses information about different configurations.
Awkward arrays can be written to a compressed parquet file. This means you could have part of your analysis be merging the LHE files and then subsequent steps just read the merged array from the parquet file. Reading from parquet is much faster than from the LHE and so it will be faster than reading from the individual LHEs and merging. And you only need to perform (the slow) reread from LHE if you have different files or new files you want included in the merged array. Generally, my code looks like
# get arrays for each file
unmerged_arrays = [pylhe.to_awkward(pylhe.read_lhe_with_attributes(f)) for f in list_of_input_files]
# merge arrays into single mega-array
array = ak.concatenate(unmerged_arrays)
# store merged array into cache parquet file
ak.to_parquet(array, 'merged.parquet')
# any below analysis code can retrieve array using ak.from_parquent('merged.parquet')
This is such a useful pattern and reading from parquet is so much faster than reading from LHE, I actually have a parquet cache file for each of the LHE files individually. The parquet file is written if it doesn't exist yet or if the original LHE file it is based on has been modified since it was written. This still drops the header information but if leaves the individual LHE files untouched so you can still go back and look up the header/init information if needed.
import pylhe
import awkward as ak
import os
def _parquet_cache(lhe_fp) :
"""Determine the parquet cache file name by replacing the LHE extension"""
return os.path.splitext(os.path.splitext(lhe_fp)[0])[0]+'.parquet'
def _from_pylhe(lhe_fp) :
"""read an LHE file into an awkward array in memory"""
return pylhe.to_awkward(pylhe.read_lhe(lhe_fp))
def convert_to_parquet(lhe_fp) :
"""Convert the input LHE file into a parquet file of the same name and location
but with the extension updated
Converting the LHE file to a parquet file is beneficial because the resulting
parquet file is about the same size as the gzipped LHE file but it offers about
2 orders of magnitude speed up when reading the data back into an awkward array
in memory.
Parameters
----------
lhe_fp : str
path to LHE file to convert
"""
ak.to_parquet(_from_pylhe(lhe_fp), _parquet_cache(lhe_fp))
def from_lhe(filepath, *, parquet_cache = True) :
"""Load an awkward array of the events in the passed LHE file
Parameters
----------
filepath : str
Path to LHE file to load
parquet_cache : bool, optional
If true, use a parquet file alongside the LHE file to cache the parsing.
This caching makes sure to update the cache if the LHE file timestamp is
newer than the parquet cache timestamp. If false, never use a cache.
"""
# need the file to exist
if not os.path.exists(filepath) :
raise ValueError(f'Input LHE file {filepath} does not exist.')
# leave early without even thinking about cache if user doesn't want it
if not parquet_cache :
return _from_pylhe(filepath)
# if cache doesn't exist or its last modification time is earlier than
# the last modification time of the original LHE file, we need to create
# the cache file
cache_fp = _parquet_cache(filepath)
if not os.path.exists(cache_fp) or os.path.getmtime(cache_fp) < os.path.getmtime(filepath) :
convert_to_parquet(filepath)
# load the data from the cache
return ak.from_parquet(cache_fp)
Hello @tomeichlersmith. Thank you for sharing 👍!
Yes, agreed on drawback of solution 1. As for solution 2, I would make the same suggestion as above - others may find it useful, would you be willing to write a notebook to add to the repo for future reference? You can even point to this discussion and state the notebook as one solution. Others can be added. Having some notebook "Dealing with multiple LHE files" seems good.
Any other thoughts @matthewfeickert?
Is there a way to read a list of LHE files, like a TChain of Root files ? If not would be possible to implement this feature in Pylhe ?