Use existing alignment (in form of text grids) to generate acoustic models?

This is a common request. A decent number of people have told me they already have some amount of phone-level alignments and want to use them to train. (So that that tedious work will be worth something?) But I don't know of any work that suggests that bootstrapping without phoneme boundaries is worse than initializing with phoneme boundaries---despite the fact that it would be relatively to do in HTK or any other toolkit. If anyone can give me that evidence, I would reconsider, but without this I would actively discourage anyone from doing manual phoneme segmentation except perhaps for evaluation purposes.

It would be possible to do an experiment where you first used the aligner to build bootstrapped models, and then compare them to ones estimated directly within HTK (just read the manual and use the binaries) using phoneme alignments. And if the latter are better, then we should reconsider.

I know that precise boundary detection is important for many linguistic problems, but I don't think this will help much. The estimation doesn't assign any particular significance to boundaries and is not optimizing for boundary placement. If segment boundaries are extrinsically important, the best thing to do would be to treat the aligner boundaries as a first pass hypothesis and then use a custom-built tool to detect the "true" boundary. (Similar to what Morgan did with VOT, for instance. We could probably come up with a much more general linguistic boundary detector with the same properties, were this important to a project.)

prosodylab / Prosodylab-Aligner

Use existing alignment (in form of text grids) to generate acoustic models? #42