Open jzluo opened 3 months ago
The new .from_indexable is significantly slower for some reason.
.from_indexable
n_jobs=4:
$ python lead_mmap.py 0%| | 96/4774058 [04:12<3438:31:15, 2.59s/it]
serial from generator:
$ python lead_mmap.py 1617it [00:25, 63.87it/s]
parallel setup:
class DataLoader: def __init__(self, paths, raw_leads=True, wavetype="median"): self.paths = paths self.raw_leads = raw_leads self.wavetype = wavetype def __len__(self): return len(self.paths) def __getitem__(self, idx): path = self.paths[idx] try: ecg_array = read_ecg_leads(path, raw_leads=self.raw_leads, wavetype=self.wavetype) except Exception as e: logging.error(f"Error processing {ecg_path}: {e}") return ecg_array dataloader = DataLoader(ecg_paths, raw_leads=True, wavetype="median") memmap = RaggedMmap.from_indexable( '/home/jon/projects/ecg/ecg_db/ecg_mmaps/median_raw', dataloader, n_jobs=4, verbose=True, batch_size=24, )
serial:
def lead_array_generator(ecg_paths, raw_leads=True, wavetype="median"): for ecg_path in ecg_paths: while True: try: yield read_ecg_leads(ecg_path, raw_leads, wavetype) break except Exception as e: logging.error(f"Error processing {ecg_path}: {e}") memmap = RaggedMmap.from_generator( '/home/jon/projects/ecg/ecg_db/ecg_mmaps/median_raw', sample_generator=lead_array_generator(ecg_paths, raw_leads=True, wavetype="median"), batch_size=24, verbose=True, )
Thanks for reporting this. I will investigate it as soon as I can
The new
.from_indexable
is significantly slower for some reason.n_jobs=4:
serial from generator:
parallel setup:
serial: