scikit-hep / hepconvert

BSD 3-Clause "New" or "Revised" License
11 stars 1 forks source link

Smarter parquet merging to avoid memory issues #106

Open ikrommyd opened 1 month ago

ikrommyd commented 1 month ago

Hello,

Following Jim's instructions, I was able to do this to merge some parquet files that wouldn't normally by an ak.from_parquet -> ak.to_parquet sequence due to memory limitations:

In [8]: folder = "EGamma0_Run2024F-PromptEGMNano-v1_NANOAOD"

In [9]: files = [f"{folder}/{x}" for x in os.listdir(folder) if x.endswith(".parquet")][:3]

In [10]: files
Out[10]:
['EGamma0_Run2024F-PromptEGMNano-v1_NANOAOD/NTuples-part000.parquet',
 'EGamma0_Run2024F-PromptEGMNano-v1_NANOAOD/NTuples-part001.parquet',
 'EGamma0_Run2024F-PromptEGMNano-v1_NANOAOD/NTuples-part002.parquet']

In [11]: folder
Out[11]: 'EGamma0_Run2024F-PromptEGMNano-v1_NANOAOD'

In [12]: def generate():
    ...:     for f in files:
    ...:         array = ak.from_parquet(f)
    ...:         yield array
    ...:         del array
    ...:

In [13]: ak.to_parquet_row_groups(generate(), f"{folder}.parquet")
Out[13]:
<pyarrow._parquet.FileMetaData object at 0x7fe93ec623b0>
  created_by: parquet-cpp-arrow version 13.0.0
  num_columns: 272
  num_rows: 57038
  num_row_groups: 3
  format_version: 2.6
  serialized_size: 0

I'm just posting this here because perhaps something around this logic can be implemented in hepconvert to get a bit smarter parquet merging.