BUG: Time and memory inefficient concating in pandas on every case.

pwoznicki / AutoRadiomics

The easiest tool for experimenting with radiomics features.

Apache License 2.0

36 stars 9 forks source link

BUG: Time and memory inefficient concating in pandas on every case. #4

Closed laqua-stack closed 2 years ago

laqua-stack commented 2 years ago

In the feature extraction, we concat a pd.DataFrame for every case. AFAIK this construction of a pd.DataFrame leads to a new memory allocation (and copying) every time, which is highly memory inefficient. Especially, when parallelized on many CPUs, combined with the already memory intensive forking in joblib this can lead to OOM-Events (and is slow of course). Wouldn't it be more convenient to return only the feature set, that is currently processed. https://github.com/pwoznicki/AutoRadiomics/blob/e475893c566de057d742f32da5cb9ece23a44eb0/autorad/feature_extraction/extractor.py#L109-L115 These are subsequently collected in results anyways: https://github.com/pwoznicki/AutoRadiomics/blob/e475893c566de057d742f32da5cb9ece23a44eb0/autorad/feature_extraction/extractor.py#L135-L144

pwoznicki commented 2 years ago

Hi! I completely agree with you, that concatenation was unnecessary and in fact I believe .get_features_for_single_case() should rather return a simpler dict instead of a pandas.Series.

I worked on it a bit and have right now merged the changes into the main. Take a look and let me know in case something can be further improved. I'm sorry in advance that the API changed, i.e. num_threads is now n_jobs and is specified in the constructor, .run() replaced extract_features().

laqua-stack commented 2 years ago

Solved in #5