kaegi / alass

"Automatic Language-Agnostic Subtitle Synchronization"
GNU General Public License v3.0
994 stars 52 forks source link

Sync not working against video directly but working perfectly against subs generated from pytranscribe #11

Closed Dnkhatri closed 4 years ago

Dnkhatri commented 4 years ago

Sync not working against video directly but working perfectly against subs generated with pytranscriber. As it generates transcriptions loud noises and music are ignored so the syncing accuracy is a lot better. https://github.com/raryelcostasouza/pyTranscriber

kaegi commented 4 years ago

Yes, the current WebRTC "voice-activity-detection" module acts more like a "noise detection" (very poor results in general; but often good enough for the alignment algorithm). There already exist other VAD architectures which promise a much lower classification error for high-noise audio data.

The central goals for alass is using a fast, offline and language-agnostic VAD module. A welcome change is swapping out the current VAD for a better one, which still fulfills these criteria. I've not tried any alternative myself.

pyTranscriber apparently depends on the Google Cloud Speech Server, which is very risky for long-term reliability of the program.

FYI: the simple and highly optimized WebRTC VAD already takes about 2 seconds for a 2h movie

apommel commented 4 years ago

I tried with BingLingGroup/autosub, it indeed seems to give good results, but it's true that the speed is not comparable at all (their process takes around 20 minutes for one hour of content) even though it seems like they're working on making it better. And it is not offline.

Dnkhatri commented 4 years ago

I tried with BingLingGroup/autosub, it indeed seems to give good results, but it's true that the speed is not comparable at all (their process takes around 20 minutes for one hour of content) even though it seems like they're working on making it better. And it is not offline.

I think the bottleneck might be your internet speed. I am able to transcribe about 10 40-50 min episodes in 20-25 minutes with pytranscriber. When I ran 5 instances of pytranscriber with 10 episode batches they all finished in under 30 minutes. Though after playing around with alass strictness settings the need to use transcribing is a lot less. Usually I need it when the show has loud noises or background songs where the VAD is detecting it wrongly.