Open ahkarami opened 3 years ago
On this subject, I find it a bit weird how much faster silence detection is in FFMPEG compared to pydub. it seems obvious to me that something like FFPMEG which is system-specific AND ridiculously popular will have many advantages over something like pydub, as the very large userbase will usually result in an increase of not only the number of contributors but their skill level as well, which increases the chances of optimizations being implemented, but even after considering all that I find it weird that, given an audio file (I used one with around 33 minutes and 20 seconds of length), FFMPEG wil take less then 1.5 seconds to detect some 480 silent segments, while pydub will take almost 450 seconds to split the file in 388 chunks. I included some code below (BTW I'm using the time
function from Fish Shell):
~ $ time ffmpeg -i "/home/frnco/Test.wav" -af silencedetect=n=-20dB:d=1 -f null - 2>&1 > /dev/null | grep 'silence_end' | wc -l
480
________________________________________________________
Executed in 1.38 secs fish external
usr time 1.28 secs 0.00 millis 1.28 secs
sys time 0.13 secs 3.88 millis 0.12 secs
~ $ time python -c 'import os; from pydub import AudioSegment; from pydub.silence import split_on_silence; chunks = split_on_silence(AudioSegment.from_file(os.path.join(os.getcwd(), "Test.wav")), min_silence_len = 1000, silence_thresh = -40, keep_silence = True); print(len(chunks))'
388
________________________________________________________
Executed in 447.68 secs fish external
usr time 443.50 secs 898.00 micros 443.50 secs
sys time 1.82 secs 166.00 micros 1.82 secs
I realize there are many differences between the results of each command and that the silence threshold is a bit off, but if I consider the outputs, the splitting part is the biggest difference, and for my use-case it's not all that hard to switch between those two approaches so I just switched to pure FFMPEG temporarily, but I still believe a difference this big should be caused not only by the lack of actual splitting of the audio and harder-to-figure-out optimizations, but also include as a factor some optimization(s) that are easier to implement.
I did take a look at FFMPEG's code to see if I could notice anything obvious, but I have VERY llittle experience with C and haven't worked with C for a long, long time so pretty much all I could learn from my skimming was that the number of classes and steps involved is too big to figure out just by skimming, (And that apparently FFMPEG works in a pretty linear way, using static av_always_inline
to detect silent segments while keeping track of how many such silent segments have been found in a row, and using an if
to print the info when that particular sequence meets the desired silence duration.)
Pydub on the other hand I can understand a lot better (Although I'm more of a Ruby guy, I do use Python for quite a few things and I'm a lot more familiar with it than with C), and I did notice splitting relies on detecting nonsilent segments which in turn relies on actual silence detection, which led me to test both functions with similar results to the splitting I had attempted previously, suggesting the detect_silence
function is responsible for most of the processing time. The results are below:
~ $ time python -c 'import os; from pydub import AudioSegment; from pydub.silence import detect_silence; chunks = detect_silence(AudioSegment.from_file(os.path.join(os.getcwd(), "Test.wav")), min_silence_len = 1000, silence_thresh = -40); print(len(chunks))'
388
________________________________________________________
Executed in 452.02 secs fish external
usr time 446.88 secs 0.00 millis 446.88 secs
sys time 2.09 secs 1.93 millis 2.09 secs
~ $ time python -c 'import os; from pydub import AudioSegment; from pydub.silence import detect_nonsilent; chunks = detect_nonsilent(AudioSegment.from_file(os.path.join(os.getcwd(), "Test.wav")), min_silence_len = 1000, silence_thresh = -40); print(len(chunks))'
388
________________________________________________________
Executed in 447.61 secs fish external
usr time 444.83 secs 0.00 millis 444.83 secs
sys time 1.09 secs 1.40 millis 1.09 secs
Upon looking into that function I noticed it does a few interesting things, such as detecting gaps in the silence, which I believe FFMPEG does not (Though I may be wrong about that), but what caught my attention was the fact that, if I'm not mistaken, pydub actually starts by converting the audio into a list of segments with the target silence duration, testing that to figure out silences. If I'm not mistaken about that, then I believe that explains why silence detection in pydub takes so much time for longer audios, after all, when looking for a thousand milliseconds of silence in a 10-second audio FFMPEG will run 10 thousand checks each on 1ms of audio (Every 1ms) while pydub will produce a list with 1-second segments every 1ms until it reaches the 9-second-mark, producing a list with 9 thousand one-second segments, which in turn will be tested, thus resulting in 9 thousand checks, but each running against one second of audio (Which, if then run by something that works in a manner similar to FFMPEG, will result in 1 thousand tests for each of pydub's 9000 one-second segments, totaling 9.000.000 tests).
Of course I may be horribly wrong and may have completely misunderstood the code, in which case the last paragraph may be completely wrong, but that doesn't change the fact that most likely there is some weirdness going on here, and if my reasoning does actually point to something relevant, it might be a good idea to consider refactoring the silence.py
file to include some of the ideas from FFMPEG, most importantly testing smaller segments, testing each segment once and only once, and using that as a foundation for the other features (i.e., checking if a non-silent segment is actually a blip and then re-classifying it as a silence gap instead of an actual non-silent segment).
I realize some features such as blip detection may be easier to implement using the current implementation, but I did find pydub quite easy to work with and it made figuring out how to do what I wanted a lot easier, so even though I dropped it in favor of FFMPEG for now I think the least I could do was take a little time to check out the code and put some decent feedback together. I do realize PRs are a better way to contribute and I may at some point decide to take on the task of optimizing pydub's silence detection so I can use it across multiple platforms, but since I won't be able to do that at least in the near future, I thought I should at least point out how big the difference is when compared to FFMPEG and which parts of the code looked to me as the most probable causes of such a big difference in performance. Hope this helps even if only a little.
i use seek_step = 100
to make pydub.silence.split_on_silence
faster
default value is 1 = 1 millisecond
Thanks for the great repo. The
silence.detect_silence
&silence.detect_nonsilent
methods seem to be slow. In my system, when I applied these methods on a long song (e.g., 100 minutes song), they just used one core of CPU & the entire process needs ~ 4 minutes. Can we make these methods parallel (e.g., via multi-processing) to speed up them? Thanks