Parallelize feats extraction with opensmile

fabiocat93 commented 2 weeks ago

This PR introduces parallelization for feature extraction processes using opensmile ~~and praat_parselmouth~~ to improve performance on large datasets. Key changes include:

[x] Registering a custom serializer for opensmile.Smile objects
[x] Implementing a Pydra workflow for opensmile audio processing
[ ] ~~Improving the Pydra workflow for parselmouth audio processing (maybe by making parselmouth.Sound objects pickable)~~

fabiocat93 commented 2 weeks ago

This PR introduces parallelization for feature extraction processes using opensmile and praat_parselmouth to improve performance on large datasets. Key changes include:

[x] Registering a custom serializer for opensmile.Smile objects

[x] Implementing a Pydra workflow for opensmile audio processing

I have addressed your comments here and parallelized feature extraction with opensmile. I followed @wilke0818's suggestion to create a custom serializer for the opensmile.Smile object. I remembered I tried some time ago with no success, but this time I had more time to study opensmile's documentation and made it work. The issue was that opensmile.Smile includes a reference to the process and the serializer doesn't like that. By removing that reference, everything seems to work fine.

[ ] Improving the Pydra workflow for parselmouth audio processing (maybe by making parselmouth.Sound objects pickable)

@satra I followed your suggestion here to use cloudpickle to make parselmouth.Sound pickable, but unfortunately didn't work out (it still says TypeError: cannot pickle 'parselmouth.Sound' object). As an experiment, I created a wrapper to parselmouth.Sound (see here) but I honestly don't like this solution because

it doesn't produce any speedup compared to the original solution here
it doesn't really make the code cleaner or more maintainable than it was.

In case you want to try any alternative solutions, or have ideas, please let me know

satra commented 2 weeks ago

thanks @fabiocat93 for these enhancements and attempts. i think the parselmouth one is good enough for now, no need to try to make it more pickleable.

efficient parallelization is going to be a combined function of dataset diversity (number of samples x duration of sample), the types of features we will be extracting, the resources (the hardware, job scheduler, etc.,.) needed.

with the b2ai dataset i ran into many of these considerations (without even considering gpu options). so let's merge something like this in, and when we do the code review let's consider possible options for efficiency. also let's get feedback as people use this.

codecov-commenter commented 2 weeks ago

Codecov Report

Attention: Patch coverage is 89.58333% with 5 lines in your changes missing coverage. Please review.

Project coverage is 63.98%. Comparing base (113721a) to head (4593c61). Report is 37 commits behind head on main.

Files with missing lines	Patch %	Lines
src/senselab/__init__.py	60.00%	2 Missing :warning:
...rc/senselab/audio/tasks/features_extraction/api.py	0.00%	1 Missing :warning:
...selab/audio/tasks/features_extraction/opensmile.py	95.23%	1 Missing :warning:
...health_measurements/extract_health_measurements.py	0.00%	1 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #181 +/- ## ========================================== + Coverage 60.24% 63.98% +3.74% ========================================== Files 113 116 +3 Lines 4017 4101 +84 ========================================== + Hits 2420 2624 +204 + Misses 1597 1477 -120 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

satra commented 2 weeks ago

could you perhaps merge the other PR that i had (without a release) and then release it with this?

satra commented 1 week ago

@fabiocat93 - upgrade to latest pydra release to try. and do post what the issues are with cf.

satra commented 1 week ago

defaulting to serial makes sense

fabiocat93 commented 1 week ago

@fabiocat93 - upgrade to latest pydra release to try.

Done.

and do post what the issues are with cf.

While testing pydra with plugin="cf" and passing some torch.tensor objects as parameters to tasks, I encountered an issue where the workflow would hang forever. After troubleshooting with @wilke0818, we identified a workaround that (at least temporarily) resolves the problem:

from multiprocessing import set_start_method
set_start_method("spawn", force=True)

satra commented 1 week ago

yes, i should have told you that (that's what i debugged over the weekend on linux). on macos spawn is default on linux it's fork and spawn will become default across systems from 3.14 onwards

satra commented 1 week ago

see here:

https://github.com/sensein/b2aiprep/blob/1cc589789d54595ac4b767a7f0bfb9654268c8b0/src/b2aiprep/prepare/prepare.py#L252

btw, there were some weird notions of that it would not work if placed it in cli.py under if __name__ == '__main__'

fabiocat93 commented 1 week ago

do you think we can merge now? @satra

satra commented 1 week ago

thank you @fabiocat93

sensein / senselab

Parallelize feats extraction with opensmile #181

Codecov Report