prosodylab / Prosodylab-Aligner

Python interface for forced audio alignment using HTK and SoX
http://prosodylab.org/tools/aligner/
MIT License
331 stars 77 forks source link

Ideal Corpus #81

Closed miracle2k closed 4 years ago

miracle2k commented 4 years ago

I am trying to train a model for a new language. I have:

Since the process is self-learning, I struggle to understand the trade-offs on the "training/model axis". That is:

If there are papers or other resources that help me understand better what causes better/worse results in the alignment process, I'd be grateful to find them.

kylebgorman commented 4 years ago

I am trying to train a model for a new language. I have:

  • a set of audio files and transcriptions I want to align, of many different speakers, in many different situations, with varying recording qualities, speed of speech and so on. Let's call this the target material. -a set of already aligned audio files of a limited set of speakers, in consistent studio quality. Let's call this the speech corpus.

This tool isn't set up to use pre-aligned data. Others may be but I don't know of any evidence that manually aligned data is really all that useful for training aligners.

Since the process is self-learning, I struggle to understand the trade-offs on the "training/model axis". That is:

  • should I run the training on my target material, forgoing the speech corpus? After all, this is the actual material I want to align, not those other speakers in the corpus.

Yeah I think that's the first thing I'd recommend to try.

Second thing I'd recommend is to try is to also throw in the already aligned data (but ignore the alignment). It probably won't hurt.

Third thing I'd recommend is to do some training with an "omnibus" multispeaker corpus and then split the model into speaker-dependent ones (so you'd train it a few more iterations on just one speaker). That isn't automated by this tool but it should be straightforward to do, especially using the Python interface (rather than a purely command-line interface).

You can also use hold out some of your pre-aligned data to measure alignment quality. A standard metric would be to count the number of boundaries within some x ms (50ms?) of the gold data and then divide by the total number of boundaries; call that boundary accuracy. That's a "NLP"ish way to evaluate; speech people would probably instead compute 100 x (1 - boundary accuracy) and call it "boundary error rate".

  • should I limit the size of the training corpus in some way? I understand at some point training my take too long, but is that the only consideration that would cause me to say: beyond this amount of data, training is done and we use the model (as opposed to using the training process itself for alignment).

I think so, yeah. Either it takes too long, or, occasionally, you can hit the point where the underlying HTK library doesn't have enough numerical precision to handle that much data and errors out. (There's a complex way to hack around this using minibatches but we haven't implemented it here yet. The Montreal Aligner tool may have something.)

  • indeed, what is a good corpus size?
  • should I expect better results using the high-quality speech corpus for training?

If there are papers or other resources that help me understand better what causes better/worse results in the alignment process, I'd be grateful to find them.

There's of course tons of studies which compare different aligner packages, treating them as a black box, but off the top of my head I don't know of anything that looks directly at this question. But I don't really work in this area anymore.

miracle2k commented 4 years ago

Thank you for the extensive response Kyle, that was helpful.