readbeyond / aeneas

aeneas is a Python/C library and a set of tools to automagically synchronize audio and text (aka forced alignment)
http://www.readbeyond.it/aeneas/
GNU Affero General Public License v3.0
2.48k stars 222 forks source link

Automated text fragmentation based on audio analysis #69

Open readbeyond opened 8 years ago

readbeyond commented 8 years ago

See also https://groups.google.com/forum/#!topic/aeneas-forced-alignment/LNKiAw1Zj14

pettarin commented 8 years ago

Working on it

readbeyond commented 7 years ago

Too big, too radical. It should be in a major release. If we find some sponsor for working on it.

RyanEdwardHall commented 7 years ago

I'm thinking about ways to implement this without doing too much work, and I'm considering the following approach. Do a first pass with each word as its own fragment. Then iterate over the output and build up longer fragments by adding the duration of each word until I see that I've hit some threshold - maybe I want fragments that are at least 5 seconds long, or maybe there seems to be a long period of silence before the next word. I'm assuming based on this issue that there are currently no other ways to create fragments with aeneas?

readbeyond commented 7 years ago

On 09/06/2017 09:31 PM, Ryan Hall wrote:

I'm thinking about ways to implement this without doing too much work, and I'm considering the following approach. Do a first pass with each word as its own fragment. Then iterate over the output and build up longer fragments by adding the duration of each word until I see that I've hit some threshold - maybe I want fragments that are at least 5 seconds long,

Hi,

this does not work well for languages, e.g. Italian, where the phrases have a big variance in length --- i.e. you can have very "segmented" breathing but also long strings of uninterrupted adjectives+nouns pronounced together.

But it might be a reasonable approach for other languages --- mind, I am not an expert on this topic.

or maybe there seems to be a long period of silence before

the next word.

This works better, yes, especially if the speaker has a disciplined breathing.

I'm assuming based on this issue that there are currently

no other ways to create fragments with aeneas?

Assuming "with aeneas" means "with the tools/features that are computed/implemented in aeneas" --- i.e. MFCCs and VAD and so on --- then probably no.

But again, let me remark that in the speech processing literature many other approaches have been investigated, most of them yield better results (also at word level) because they rely on more sophisticated (math on) better linguistic/phonetic models. Of course the drawback is that they work only if said models are available or can be built reasonably fast/cheap.

All this to say: assuming you are interested in English (in contrast with "exotic" languages like Icelandic) and in aligning at word level, why not using one of the many other aligners out there? (See: https://github.com/pettarin/forced-alignment-tools ) --- genuine question, I am interested in finding out the reasons for why people are interested in improving aeneas, especially at word level.

Thank you,

AP

RyanEdwardHall commented 7 years ago

My use case is generating closed captions for video content. Aeneas has great documentation, is actively developed, and it looks like it can be used for commercial purposes (unlike the projects based off HTK). I'm primarily looking at English but the ability to add more languages in the future is appealing. Really appreciate the detailed response and you've given me lots to think about!

readbeyond commented 7 years ago

OK, thanks for taking time to let me know, much appreciated.

Alberto Pettarin