DataLoader performance is bad?

patgo25 commented 10 months ago

At the analysis workshop, it was argued that the slowness of the DataLoader was only due to the IO speed. When I loaded the data with "low-level routines", I never felt it was that slow. To test this quantitatively, I wrote a quick script to compare low-level loading with DataLoader loading.

The script is:

import lgdo.lh5_store as store
from legendmeta import LegendMetadata
import re
import glob
import numpy as np
import time
from pygama.flow import DataLoader

def low_level_load(period="p03",run="r000",ndets=10):
    start = time.time()
    meta_path = "/data2/public/prodenv/prod-blind/ref/v01.06/inputs"
    f_hit = sorted(glob.glob("/data2/public/prodenv/prod-blind/ref/v01.06/generated/tier/hit/phy/"+period+"/"+run+"/*.lh5"))
    f_tcm = [e.replace("hit", "tcm") for e in f_hit]
    dt_files = time.time() - start
    print(f"time to located files: \t {dt_files:.3f} s")

    lmeta = LegendMetadata(path=meta_path)
    chmap = lmeta.channelmap(re.search(r"\d{8}T\d{6}Z", f_hit[0]).group(0))
    geds = list(chmap.map("system", unique=False).geds.map("daq.rawid").keys())[:ndets]
    dt_meta = time.time() - start - dt_files
    print(f"time to load meta: \t {dt_meta:.3f} s")

    # load TCM data to define an event
    nda = store.load_nda(f_tcm, ["array_id", "array_idx","cumulative_length"], "hardware_tcm_1/")
    clt = nda["cumulative_length"]
    split = clt[np.diff(clt,append=[clt[-1]])<0]
    ids = np.split(nda["array_id"],np.cumsum(split))
    idx = np.split(nda["array_idx"],np.cumsum(split))
    dt_tcm = time.time() -start - dt_meta
    print(f"time to load tcm: \t {dt_tcm:.3f} s")

    for ch in geds:
        idx_ch = [idx[i][ids[i] == ch] for i in range(len(idx))]
        nda = store.load_nda(f_hit,['cuspEmax_ctc_cal','AoE_Classifier'],f"ch{ch}/hit/",idx_list=idx_ch)
        mask = nda['cuspEmax_ctc_cal'] > 25
        nda['AoE_Classifier'][mask]

    dt_end = time.time() - start -dt_tcm
    print(f"time to load data: \t {dt_end:.3f} s")

def data_loader_load(period="p03",run="r000",ndets=10):
    start = time.time()
    prodenv = "/data2/public/prodenv"
    dl = DataLoader(f"{prodenv}/prod-blind/ref/v01.06[setups/l200/dataloader]")
    file_query = f"period == '{period}' and run == '{run}' and datatype == 'phy'"
    dl.reset()
    dl.data_dir = prodenv
    dl.set_files(file_query)
    dt_files = time.time() - start
    print(f"time to located files: \t {dt_files:.3f} s")

    first_key = dl.get_file_list().iloc[0].timestamp
    lmeta = LegendMetadata(f"{prodenv}/prod-blind/ref/v01.06/inputs")
    chmap = lmeta.channelmap(on=first_key)  # get the channel map
    geds = list(chmap.map("system", unique=False).geds.map("daq.rawid").keys())[:ndets]
    dl.set_datastreams(geds, "ch")
    dt_meta = time.time() - start - dt_files
    print(f"time to load meta: \t {dt_meta:.3f} s")

    dl.set_cuts({"hit": "trapEmax_ctc_cal > 25"})
    dl.set_output(columns=["AoE_Classifier"], fmt="lgdo.Table")
    geds_el = dl.build_entry_list(tcm_level="tcm")
    dt_tcm = time.time() -start - dt_meta
    print(f"time to load tcm: \t {dt_tcm:.3f} s")

    dl.load(geds_el)
    dt_end = time.time() - start - dt_tcm
    print(f"time to load data: \t {dt_end:.3f} s")

if __name__ == "__main__":
    print("Try low level:")
    low_level_load()
    print("Try Data loader:")
    data_loader_load()

First, I would ask you if you can find any unfair treatments in one or the other routine. Booth routines should:

Use data from period 03 run 000 for 10 detectors
Set a cut on the energy > 25 keV
Load the AoE values for the 10 geds where the cut is valid according to the TCM

The result on LNGS of the script is:

> python speed_test.py 
Try low level:
time to located files:   0.003 s
time to load meta:       0.931 s
time to load tcm:        1.605 s
time to load data:       10.845 s
Try Data loader:
time to located files:   0.918 s
time to load meta:       0.763 s
time to load tcm:        172.463 s
time to load data:       7.972 s

jasondet commented 10 months ago

what is going on with the tcm loading!

okay thanks for testing, will check this asap.

SamuelBorden commented 10 months ago

It looks like this performance issue could be related to these: https://forum.hdfgroup.org/t/performance-reading-data-with-non-contiguous-selection/8979 and https://github.com/h5py/h5py/issues/1597

We're going to try the same workaround in read_object in order to speed up calls when an idx is provided.

gipert commented 10 months ago

Where do we do indexed reading in the data loader other than in .load(), when an entry list is used?

jasondet commented 9 months ago

the idx read appears to have been a red herring. I found a factor of ~3 speed up for that tcm read step (really: build entry list) in data_loader and another factor of ~2 in lgdo.table.get_dataframe. However the overall read is still a factor of ~3 slower than Patrick's low-level read:

Singularity> python kraus_speed_test.py
Try low level:
time to located files:   0.003 s
time to load meta:       1.050 s
time to load tcm:        1.632 s
time to load data:       9.426 s
Try Data loader:
time to located files:   0.726 s
time to load meta:       0.891 s
time to load tcm:        26.584 s
time to load data:       7.684 s

There is a lot of complexity in data_loader having to do with dealing with multi-level cuts and other generalizations, but I hope it can be sped up more with some refactoring. It looks like there is still a lot going on in the inner loops.

gipert commented 9 months ago

Nice! Performance of the LGDO conversion methods is also a topic for https://github.com/legend-exp/legend-pydataobj/pull/30 by @MoritzNeuberger.

Did anyone try profiling the code to spot straightforward bottlenecks?

legend-exp / pygama

DataLoader performance is bad? #521