slow down with loads of subjects due to quadratic cast of get_runs function

mne-tools / mne-bids-pipeline

Automatically process entire electrophysiological datasets using MNE-Python.

https://mne.tools/mne-bids-pipeline/

BSD 3-Clause "New" or "Revised" License

140 stars 67 forks source link

slow down with loads of subjects due to quadratic cast of get_runs function #715

Closed agramfort closed 1 year ago

agramfort commented 1 year ago

@apmellot played with a dataset with 1400 subjects and the pipeline appeared super slow to get started. The quality step takes more than a day on the NFS disks... When profiling it seems the bottleneck is in the get_runs function. It calls get_runs_all_subjects for all subject so it seems we have a quadratic complexity here.

It seems we need to change the logic or use more caching.

agramfort commented 1 year ago

adding @functools.lru_cache(maxsize=None) on get_runs_all_subjects leads to TypeError: unhashable type: 'types.SimpleNamespace'

Indeed types.SimpleNamespace is not Hashable like a dict. What's annoying is that joblib.Memory manages to do the job easily.

@larsoner do you have any opinion on this?

larsoner commented 1 year ago

We could subclass SimpleNamespace using object_hash pretty easily, but a less hacky solution would be not to pass cfg directly but just the cfg.subjects, cfg.exclude_subjects, etc. that are needed internally by the function. And the easy way to do that is to make get_runs_all_subjects call a _get_runs_all_subjects that takes those cfg.whatever values, and lru_cache that function. That way we don't have to change all our calls to get_runs_all_subjects but we can construct a function _get_runs_all_subjects with hashable inputs