Following Jim's instructions, I was able to do this to merge some parquet files that wouldn't normally by an ak.from_parquet -> ak.to_parquet sequence due to memory limitations:
In [8]: folder = "EGamma0_Run2024F-PromptEGMNano-v1_NANOAOD"
In [9]: files = [f"{folder}/{x}" for x in os.listdir(folder) if x.endswith(".parquet")][:3]
In [10]: files
Out[10]:
['EGamma0_Run2024F-PromptEGMNano-v1_NANOAOD/NTuples-part000.parquet',
'EGamma0_Run2024F-PromptEGMNano-v1_NANOAOD/NTuples-part001.parquet',
'EGamma0_Run2024F-PromptEGMNano-v1_NANOAOD/NTuples-part002.parquet']
In [11]: folder
Out[11]: 'EGamma0_Run2024F-PromptEGMNano-v1_NANOAOD'
In [12]: def generate():
...: for f in files:
...: array = ak.from_parquet(f)
...: yield array
...: del array
...:
In [13]: ak.to_parquet_row_groups(generate(), f"{folder}.parquet")
Out[13]:
<pyarrow._parquet.FileMetaData object at 0x7fe93ec623b0>
created_by: parquet-cpp-arrow version 13.0.0
num_columns: 272
num_rows: 57038
num_row_groups: 3
format_version: 2.6
serialized_size: 0
I'm just posting this here because perhaps something around this logic can be implemented in hepconvert to get a bit smarter parquet merging.
Hello,
Following Jim's instructions, I was able to do this to merge some parquet files that wouldn't normally by an
ak.from_parquet
->ak.to_parquet
sequence due to memory limitations:I'm just posting this here because perhaps something around this logic can be implemented in
hepconvert
to get a bit smarter parquet merging.