Closed fabiocat93 closed 1 week ago
This PR introduces parallelization for feature extraction processes using opensmile and praat_parselmouth to improve performance on large datasets. Key changes include:
- [x] Registering a custom serializer for opensmile.Smile objects
- [x] Implementing a Pydra workflow for opensmile audio processing
I have addressed your comments here and parallelized feature extraction with opensmile. I followed @wilke0818's suggestion to create a custom serializer for the opensmile.Smile object. I remembered I tried some time ago with no success, but this time I had more time to study opensmile's documentation and made it work. The issue was that opensmile.Smile includes a reference to the process and the serializer doesn't like that. By removing that reference, everything seems to work fine.
- [ ] Improving the Pydra workflow for parselmouth audio processing (maybe by making parselmouth.Sound objects pickable)
@satra I followed your suggestion here to use cloudpickle
to make parselmouth.Sound pickable, but unfortunately didn't work out (it still says TypeError: cannot pickle 'parselmouth.Sound' object
). As an experiment, I created a wrapper to parselmouth.Sound (see here) but I honestly don't like this solution because
In case you want to try any alternative solutions, or have ideas, please let me know
thanks @fabiocat93 for these enhancements and attempts. i think the parselmouth one is good enough for now, no need to try to make it more pickleable.
efficient parallelization is going to be a combined function of dataset diversity (number of samples x duration of sample), the types of features we will be extracting, the resources (the hardware, job scheduler, etc.,.) needed.
with the b2ai dataset i ran into many of these considerations (without even considering gpu options). so let's merge something like this in, and when we do the code review let's consider possible options for efficiency. also let's get feedback as people use this.
Attention: Patch coverage is 89.58333%
with 5 lines
in your changes missing coverage. Please review.
Project coverage is 63.98%. Comparing base (
113721a
) to head (4593c61
). Report is 37 commits behind head on main.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
could you perhaps merge the other PR that i had (without a release) and then release it with this?
@fabiocat93 - upgrade to latest pydra release to try. and do post what the issues are with cf.
defaulting to serial
makes sense
@fabiocat93 - upgrade to latest pydra release to try.
Done.
and do post what the issues are with cf.
While testing pydra with plugin="cf"
and passing some torch.tensor objects as parameters to tasks, I encountered an issue where the workflow would hang forever. After troubleshooting with @wilke0818, we identified a workaround that (at least temporarily) resolves the problem:
from multiprocessing import set_start_method
set_start_method("spawn", force=True)
yes, i should have told you that (that's what i debugged over the weekend on linux). on macos spawn
is default on linux it's fork and spawn
will become default across systems from 3.14 onwards
see here:
btw, there were some weird notions of that it would not work if placed it in cli.py under if __name__ == '__main__'
do you think we can merge now? @satra
thank you @fabiocat93
This PR introduces parallelization for feature extraction processes using opensmile
and praat_parselmouthto improve performance on large datasets. Key changes include:Improving the Pydra workflow for parselmouth audio processing (maybe by making parselmouth.Sound objects pickable)