Read a list of LHE files with Pylhe ?

sznajder commented 1 year ago

Is there a way to read a list of LHE files, like a TChain of Root files ? If not would be possible to implement this feature in Pylhe ?

eduardo-rodrigues commented 1 year ago

Hello @sznajder. Thank you for the interest and apologies for the slow response.

Can you motivate your question? One does not typically chain "raw" outputs of MC programs, hence the question. Would it not be better and also more flexible to retrieve the info you want from every single LHE file, say in awkward arrays, likely with vectors in, and then merge at that level? Or save to ROOT files and chain there?

Just thinking/asking out loud.

Otherwise "anything" is possible. You could even have a go if interested.

sznajder commented 1 year ago

Hi Eduardo, I’m interested in running a parton level analysis using LHE files generated with Madgraph. As the jobs were generated in batches of 1K events I have hundreds of files and it would be nice to be able to chain them. On can always merge the LHE files before reading them with PyLHE or as you mentioned merge the awkward arrays. Indeed, I’m currently merging the awkward arrays in my code. Cheers, Andre

On 27 Jun 2023, at 18:00, Eduardo Rodrigues @.***> wrote:

Hello @sznajder https://github.com/sznajder. Thank you for the interest and apologies for the slow response.

Can you motivate your question? One does not typically chain "raw" outputs of MC programs, hence the question. Would it not be better and also more flexible to retrieve the info you want from every single LHE file, say in awkward arrays, likely with vectors in, and then merge at that level? Or save to ROOT files and chain there?

Just thinking/asking out loud.

Otherwise "anything" is possible. You could even have a go if interested.

— Reply to this email directly, view it on GitHub https://github.com/scikit-hep/pylhe/issues/203#issuecomment-1610206073, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGMRNVBH7PQ2OX7QQGS2KLXNNCXTANCNFSM6AAAAAAZQRWZOY. You are receiving this because you were mentioned.

eduardo-rodrigues commented 1 year ago

OK, understood.

Yes, merging LHE files seems tricky to me, but I would welcome comments from @matthewfeickert and @lukasheinrich.

Given the route you are following, which seems flexible to me, would you be interested in contributing a little notebook example inspired by what you are doing, as I'm sure that will be useful to others? We're happy to help out to get a few test files in scikit-hep-testdata and the PR itself. BTW, we do acknowledge contributors nicely, see https://github.com/scikit-hep/pylhe/blob/main/README.md#contributors :).

tomeichlersmith commented 1 year ago

I've also run into this issue from trying to merge many LHE files into one sample for later analysis and I've come upon two solutions that work with current pylhe.

1. Merge LHE Files Prior to Processing

Since LHE is a simple, text-based format, you can probably merge the LHE files by simply copying the <event>...</event> blocks from one file into the end of previous files. You could do this manually or by using some shell-scripting nonsense. The reason this is risky and I changed to using option (2) below is that it drops all of the init/header information from the LHE files you are copying from. This at best drops information about which random seed is being used and at worst loses information about different configurations.

2. Merge the awkward arrays and then write a cache file

Awkward arrays can be written to a compressed parquet file. This means you could have part of your analysis be merging the LHE files and then subsequent steps just read the merged array from the parquet file. Reading from parquet is much faster than from the LHE and so it will be faster than reading from the individual LHEs and merging. And you only need to perform (the slow) reread from LHE if you have different files or new files you want included in the merged array. Generally, my code looks like

# get arrays for each file
unmerged_arrays = [pylhe.to_awkward(pylhe.read_lhe_with_attributes(f)) for f in list_of_input_files]
# merge arrays into single mega-array
array = ak.concatenate(unmerged_arrays)
# store merged array into cache parquet file
ak.to_parquet(array, 'merged.parquet')
# any below analysis code can retrieve array using ak.from_parquent('merged.parquet')

This is such a useful pattern and reading from parquet is so much faster than reading from LHE, I actually have a parquet cache file for each of the LHE files individually. The parquet file is written if it doesn't exist yet or if the original LHE file it is based on has been modified since it was written. This still drops the header information but if leaves the individual LHE files untouched so you can still go back and look up the header/init information if needed.

import pylhe
import awkward as ak
import os

def _parquet_cache(lhe_fp) :
    """Determine the parquet cache file name by replacing the LHE extension"""
    return os.path.splitext(os.path.splitext(lhe_fp)[0])[0]+'.parquet'

def _from_pylhe(lhe_fp) :
    """read an LHE file into an awkward array in memory"""
    return pylhe.to_awkward(pylhe.read_lhe(lhe_fp))

def convert_to_parquet(lhe_fp) :
    """Convert the input LHE file into a parquet file of the same name and location
    but with the extension updated

    Converting the LHE file to a parquet file is beneficial because the resulting
    parquet file is about the same size as the gzipped LHE file but it offers about
    2 orders of magnitude speed up when reading the data back into an awkward array
    in memory.

    Parameters
    ----------
    lhe_fp : str
        path to LHE file to convert
    """

    ak.to_parquet(_from_pylhe(lhe_fp), _parquet_cache(lhe_fp))

def from_lhe(filepath, *, parquet_cache = True) :
    """Load an awkward array of the events in the passed LHE file

    Parameters
    ----------
    filepath : str
        Path to LHE file to load
    parquet_cache : bool, optional
        If true, use a parquet file alongside the LHE file to cache the parsing.
        This caching makes sure to update the cache if the LHE file timestamp is
        newer than the parquet cache timestamp. If false, never use a cache.
    """

    # need the file to exist
    if not os.path.exists(filepath) :
        raise ValueError(f'Input LHE file {filepath} does not exist.')

    # leave early without even thinking about cache if user doesn't want it
    if not parquet_cache :
        return _from_pylhe(filepath)

    # if cache doesn't exist or its last modification time is earlier than
    # the last modification time of the original LHE file, we need to create
    # the cache file
    cache_fp = _parquet_cache(filepath)
    if not os.path.exists(cache_fp) or os.path.getmtime(cache_fp) < os.path.getmtime(filepath) :
        convert_to_parquet(filepath)

    # load the data from the cache
    return ak.from_parquet(cache_fp)

eduardo-rodrigues commented 1 year ago

Hello @tomeichlersmith. Thank you for sharing 👍!

Yes, agreed on drawback of solution 1. As for solution 2, I would make the same suggestion as above - others may find it useful, would you be willing to write a notebook to add to the repo for future reference? You can even point to this discussion and state the notebook as one solution. Others can be added. Having some notebook "Dealing with multiple LHE files" seems good.

Any other thoughts @matthewfeickert?

scikit-hep / pylhe

Read a list of LHE files with Pylhe ? #203

1. Merge LHE Files Prior to Processing

2. Merge the awkward arrays and then write a cache file