otsaloma / gaupol

Editor for text-based subtitle files
https://otsaloma.io/gaupol/
GNU General Public License v3.0
250 stars 35 forks source link

generate new subtitles with speech recognition #194

Closed milahu closed 2 years ago

milahu commented 2 years ago

i recently hacked together a srtgen python script to generate "pretty good" subtitles with google cloud speech

"pretty good" as in, the timing is pretty accurate for all words

postprocessing is needed to group words into sentences but that could be improved by analyzing video keyframes

could be generalized with the python module speech_recognition which supports many different services (but probably most services dont return timestamps for every word)

would be a nice feature for gaupol, no?

or stick to unix philosophy? do one thing, and do it right

otsaloma commented 2 years ago

It's beginning to become clear that the state of the art in this matter is in various command line programs with very heavy dependencies and either poor or non-existent packaging, which basically means I'm very unlikely to integrate any of this into Gaupol. Thus, closing this issue.

From https://github.com/otsaloma/gaupol/issues/17

And in addition to the above, it seems that the situation is moving fast with all kinds of new tools that someone makes over the course of one weekend and then stops maintaining. There's no winner in sight, neither as methodology nor as implementation.

And something that depends on Google APIs is quite out of the question as it's very inconvenient for users to get started with and very volatile on account of changes in API, pricing etc.

The only thing that might make sense done in Gaupol is to add a documentation page linking to these various tools. Here's a couple I have run into and tried: https://github.com/otsaloma?tab=stars&q=subtitle

milahu commented 2 years ago

Here's a couple I have run into and tried: https://github.com/otsaloma?tab=stars&q=subtitle

AutoSub looks good. offline speech recognition has good sides (free, private) and bad sides (cpu/gpu usage, lower quality)

something that depends on Google APIs is quite out of the question

i expect that google speech will give the best results, my only concern is privacy cost is about $1 per hour of audio (meh) maybe i will use a hybrid solution: use google only when deepspeech fails

with AutoSub, closing as wontfix:

stick to unix philosophy? do one thing, and do it right