jiaaro / pydub

Manipulate audio with a simple and easy high level interface
http://pydub.com
MIT License
8.98k stars 1.05k forks source link

[Feature] Speed up the detect_silence & detect_nonsilent methods #595

Open ahkarami opened 3 years ago

ahkarami commented 3 years ago

Thanks for the great repo. The silence.detect_silence & silence.detect_nonsilent methods seem to be slow. In my system, when I applied these methods on a long song (e.g., 100 minutes song), they just used one core of CPU & the entire process needs ~ 4 minutes. Can we make these methods parallel (e.g., via multi-processing) to speed up them? Thanks

frnco commented 3 years ago

On this subject, I find it a bit weird how much faster silence detection is in FFMPEG compared to pydub. it seems obvious to me that something like FFPMEG which is system-specific AND ridiculously popular will have many advantages over something like pydub, as the very large userbase will usually result in an increase of not only the number of contributors but their skill level as well, which increases the chances of optimizations being implemented, but even after considering all that I find it weird that, given an audio file (I used one with around 33 minutes and 20 seconds of length), FFMPEG wil take less then 1.5 seconds to detect some 480 silent segments, while pydub will take almost 450 seconds to split the file in 388 chunks. I included some code below (BTW I'm using the time function from Fish Shell):

~ $ time ffmpeg -i "/home/frnco/Test.wav" -af silencedetect=n=-20dB:d=1 -f null - 2>&1 > /dev/null | grep 'silence_end' | wc -l
480

________________________________________________________
Executed in    1.38 secs    fish           external
   usr time    1.28 secs    0.00 millis    1.28 secs
   sys time    0.13 secs    3.88 millis    0.12 secs

~ $ time python -c 'import os; from pydub import AudioSegment; from pydub.silence import split_on_silence; chunks = split_on_silence(AudioSegment.from_file(os.path.join(os.getcwd(), "Test.wav")), min_silence_len = 1000, silence_thresh = -40, keep_silence = True); print(len(chunks))'
388

________________________________________________________
Executed in  447.68 secs    fish           external
   usr time  443.50 secs  898.00 micros  443.50 secs
   sys time    1.82 secs  166.00 micros    1.82 secs

I realize there are many differences between the results of each command and that the silence threshold is a bit off, but if I consider the outputs, the splitting part is the biggest difference, and for my use-case it's not all that hard to switch between those two approaches so I just switched to pure FFMPEG temporarily, but I still believe a difference this big should be caused not only by the lack of actual splitting of the audio and harder-to-figure-out optimizations, but also include as a factor some optimization(s) that are easier to implement.

I did take a look at FFMPEG's code to see if I could notice anything obvious, but I have VERY llittle experience with C and haven't worked with C for a long, long time so pretty much all I could learn from my skimming was that the number of classes and steps involved is too big to figure out just by skimming, (And that apparently FFMPEG works in a pretty linear way, using static av_always_inline to detect silent segments while keeping track of how many such silent segments have been found in a row, and using an if to print the info when that particular sequence meets the desired silence duration.)

Pydub on the other hand I can understand a lot better (Although I'm more of a Ruby guy, I do use Python for quite a few things and I'm a lot more familiar with it than with C), and I did notice splitting relies on detecting nonsilent segments which in turn relies on actual silence detection, which led me to test both functions with similar results to the splitting I had attempted previously, suggesting the detect_silence function is responsible for most of the processing time. The results are below:

~ $ time python -c 'import os; from pydub import AudioSegment; from pydub.silence import detect_silence; chunks = detect_silence(AudioSegment.from_file(os.path.join(os.getcwd(), "Test.wav")), min_silence_len = 1000, silence_thresh = -40); print(len(chunks))'
388

________________________________________________________
Executed in  452.02 secs    fish           external
   usr time  446.88 secs    0.00 millis  446.88 secs
   sys time    2.09 secs    1.93 millis    2.09 secs

~ $ time python -c 'import os; from pydub import AudioSegment; from pydub.silence import detect_nonsilent; chunks = detect_nonsilent(AudioSegment.from_file(os.path.join(os.getcwd(), "Test.wav")), min_silence_len = 1000, silence_thresh = -40); print(len(chunks))'
388

________________________________________________________
Executed in  447.61 secs    fish           external
   usr time  444.83 secs    0.00 millis  444.83 secs
   sys time    1.09 secs    1.40 millis    1.09 secs

Upon looking into that function I noticed it does a few interesting things, such as detecting gaps in the silence, which I believe FFMPEG does not (Though I may be wrong about that), but what caught my attention was the fact that, if I'm not mistaken, pydub actually starts by converting the audio into a list of segments with the target silence duration, testing that to figure out silences. If I'm not mistaken about that, then I believe that explains why silence detection in pydub takes so much time for longer audios, after all, when looking for a thousand milliseconds of silence in a 10-second audio FFMPEG will run 10 thousand checks each on 1ms of audio (Every 1ms) while pydub will produce a list with 1-second segments every 1ms until it reaches the 9-second-mark, producing a list with 9 thousand one-second segments, which in turn will be tested, thus resulting in 9 thousand checks, but each running against one second of audio (Which, if then run by something that works in a manner similar to FFMPEG, will result in 1 thousand tests for each of pydub's 9000 one-second segments, totaling 9.000.000 tests).

Of course I may be horribly wrong and may have completely misunderstood the code, in which case the last paragraph may be completely wrong, but that doesn't change the fact that most likely there is some weirdness going on here, and if my reasoning does actually point to something relevant, it might be a good idea to consider refactoring the silence.py file to include some of the ideas from FFMPEG, most importantly testing smaller segments, testing each segment once and only once, and using that as a foundation for the other features (i.e., checking if a non-silent segment is actually a blip and then re-classifying it as a silence gap instead of an actual non-silent segment).

I realize some features such as blip detection may be easier to implement using the current implementation, but I did find pydub quite easy to work with and it made figuring out how to do what I wanted a lot easier, so even though I dropped it in favor of FFMPEG for now I think the least I could do was take a little time to check out the code and put some decent feedback together. I do realize PRs are a better way to contribute and I may at some point decide to take on the task of optimizing pydub's silence detection so I can use it across multiple platforms, but since I won't be able to do that at least in the near future, I thought I should at least point out how big the difference is when compared to FFMPEG and which parts of the code looked to me as the most probable causes of such a big difference in performance. Hope this helps even if only a little.

milahu commented 2 years ago

i use seek_step = 100 to make pydub.silence.split_on_silence faster default value is 1 = 1 millisecond