hristo-vrigazov / mmap.ninja

Memory mapped numpy arrays of varying shapes
MIT License
277 stars 11 forks source link

Parallel batch collection is significantly slower #18

Open jzluo opened 3 months ago

jzluo commented 3 months ago

The new .from_indexable is significantly slower for some reason.

n_jobs=4:

$ python lead_mmap.py
  0%|                   | 96/4774058 [04:12<3438:31:15,  2.59s/it]

serial from generator:

$ python lead_mmap.py
1617it [00:25, 63.87it/s]

parallel setup:

class DataLoader:
    def __init__(self, paths, raw_leads=True, wavetype="median"):
        self.paths = paths
        self.raw_leads = raw_leads
        self.wavetype = wavetype

    def __len__(self):
        return len(self.paths)

    def __getitem__(self, idx):
        path = self.paths[idx]
        try:
            ecg_array = read_ecg_leads(path, raw_leads=self.raw_leads, wavetype=self.wavetype)
        except Exception as e:
            logging.error(f"Error processing {ecg_path}: {e}")

        return ecg_array

dataloader = DataLoader(ecg_paths, raw_leads=True, wavetype="median")

memmap = RaggedMmap.from_indexable(
    '/home/jon/projects/ecg/ecg_db/ecg_mmaps/median_raw',
    dataloader,
    n_jobs=4,
    verbose=True,
    batch_size=24,
)

serial:

def lead_array_generator(ecg_paths, raw_leads=True, wavetype="median"):
    for ecg_path in ecg_paths:
        while True:
            try:
                yield read_ecg_leads(ecg_path, raw_leads, wavetype)
                break
            except Exception as e:
                logging.error(f"Error processing {ecg_path}: {e}")

memmap = RaggedMmap.from_generator(
    '/home/jon/projects/ecg/ecg_db/ecg_mmaps/median_raw',
    sample_generator=lead_array_generator(ecg_paths, raw_leads=True, wavetype="median"),
    batch_size=24,
    verbose=True,
)
hristo-vrigazov commented 3 months ago

Thanks for reporting this. I will investigate it as soon as I can