Audio synthesis step required?

DanielSWolf commented 6 years ago

Hi,

I need forced alignment for a project with a different technology stack and an incompatible license (C++, MIT license). So I'm thinking of borrowing your approach and implementing a simplified version in C++.

I read your HOWITWORKS file, which sounds very sensible to me. But there is one thing I'm not sure about: The dialog I want to align isn't plain text, but already a sequence of IPA phones. So I was wondering whether the step of synthesizing speech from the dialog, then extracting MFCC is really necessary in my case. What if I extracted one or more sets of MFCC from each IPA phone in advance? Then to get an MFCC representation of the dialog, I'd simply concatenate the MFCC sets of the individual IPA phones.

Do you think this approach is feasible and would yield good results? I don't have any practical experience with MFCC yet, so I may be missing some crucial aspect here.

readbeyond commented 6 years ago

On 01/04/2018 09:08 PM, Daniel Wolf wrote:

Hi,

I need forced alignment for a project with a different technology stack and an incompatible license (C++, MIT license). So I'm thinking of borrowing your approach and implementing a simplified version in C++.

I read your HOWITWORKS file, which sounds very sensible to me. But there is one thing I'm not sure about: The dialog I want to align isn't plain text, but already a sequence of IPA phones. So I was wondering whether the step of synthesizing speech from the dialog, then extracting MFCC is really necessary in my case.

Hi,

probably not.

What if I extracted one or more sets of

MFCC from each IPA phone in advance? Then to get an MFCC representation of the dialog, I'd simply concatenate the MFCC sets of the individual IPA phones.

It might work. However note that the MFCCs of the same (IPA) phoneme are not necessarily the same even for the same speaker. The way they are realized in audio (i.e., pronounced) depends on their context, many models in the literature take into account a window around the current phoneme. However the MFCCs for a given phone are probably much closer one each other than to the MFCCs of other phones, that for this SR/FA task the differences do not matter.

Still, I am not sure how you plan to get the MFCCs for each phoneme: via a TTS engine (and therefore, what's the point of avoiding the TTS part? Just speed?) or via an available database.

Do you think this approach is feasible and would yield good results? I don't have any practical experience with MFCC yet, so I may be missing some crucial aspect here.

In general, I would recommend also having a look at HMM/GMM tools like Kaldi, it will probably give you better results than aeneas. You can find a list of open source forced aligners here: https://github.com/pettarin/forced-alignment-tools

DanielSWolf commented 6 years ago

Thanks for your helpful answers!

You can find a list of open source forced aligners here: ...

Actually, that was the very site that lead me to your project! :smile:

I'm afraid, however, that none of these fit my requirements. :disappointed: Most libraries only support English or a very small number of languages, but I'd like to support a large number of languages. Also, my code is under the MIT license, but most of the listed libraries have incompatible licenses.

That only leaves CMU Sphinx. I'm already using pocketsphinx, which is working fine for English dialog. But it requires a different language model and acoustic model for each language, and these files are rather big. So I'm looking for a more generic, language-independent solution.

Still, I am not sure how you plan to get the MFCCs for each phoneme: via a TTS engine (and therefore, what's the point of avoiding the TTS part? Just speed?) or via an available database.

Speed isn't the issue. The thing is that if I create the MFCCs at runtime, the TTS engine I use needs to be written in C/C++, needs to explicitly support all relevant languages, needs to use an MIT-compatible license, and shouldn't require large (> ~20 MB) resource files. If, however, I pre-compute the MFCCs, I don't have to care about any of this. I can use any approach I like to get the MFCCs once, and the actual code only grows by a small lookup table.

That being said, do you have any tips on how to get MFCCs covering all IPS phones?

However note that the MFCCs of the same (IPA) phoneme are not necessarily the same even for the same speaker. The way they are realized in audio (i.e., pronounced) depends on their context, many models in the literature take into account a window around the current phoneme.

That's what I was afraid of. Do you know of any papers that might illuminate that specific aspect?

readbeyond commented 6 years ago

On 01/06/2018 10:40 PM, Daniel Wolf wrote:

That being said, do you have any tips on how to get MFCCs covering all IPS phones?

espeak(-ng) lets you input an IPA string. You might want to pass a single IPA character at a time, get the corresponding audio file, extract the MFCCs and store them.

Probably you want to do the above for pairs or triples of IPA characters, instead of single IPA characters, for (possibly) better results.

This process is very rough but it might be sufficient for your needs (not sure).

However note that the MFCCs of the same (IPA) phoneme are not
necessarily the same even for the same speaker. The way they are
realized in audio (i.e., pronounced) depends on their context, many
models in the literature take into account a window around the
current phoneme.

That's what I was afraid of. Do you know of any papers that might illuminate that specific aspect?

Gales and Young, The application of HMMs in speech recognition

Trentin and Gori, A survey of hybrid ANN/HMM models for automatic speech recognition

Taylor, Text to speech synthesis

(you can find PDFs of the above by googling)

DanielSWolf commented 6 years ago

Thanks for your advice. You've been very helpful!

readbeyond commented 6 years ago

You are welcome.

AP

On 01/12/2018 08:56 AM, Daniel Wolf wrote:

Closed #192 https://github.com/readbeyond/aeneas/issues/192.

readbeyond / aeneas

Audio synthesis step required? #192