What each aligner does - Githubissues

I keep forgetting which aligner does what so here it is:

Wav2Vec2ForFrameClassification is just w2v2 with a linear layer head.
charsiu_predictive_aligned takes the argmax of the logits of the Wav2Vec2ForFrameClassification model
charsiu_forced_aligner does g2p on a given transcript, then uses those sequence of phones to index into the logits of Wav2Vec2ForFrameClassification along the phone_id axis. Then DTW can be used on the resulting sequence vs time tensor find the alignment. https://github.com/lingjzhu/charsiu/blob/13a69f2a22ca0c0962b75cc693399b0ae23a12c9/src/utils.py#L304-L305
charisu_attention_aligner uses Wav2Vec2ForAttentionAlignment which uses w2v2 for encoding speech and a BERT for encoding phonemes, and then something really over engineered. The DTW is the correct way to normalize the output of w2v2, and it seems that Wav2Vec2ForAttentionAlignment only exists because DTW was overlooked. This should be depreciated?
charsiu_chain_forced_aligner does w2v2-c2c to get phonemes, then Wav2Vec2ForAttentionAlignment followed by DTW. Perhaps this should be replaced by the charsiu_forced_aligner where the phonemes are obtained from w2v2-c2c.

lingjzhu / charsiu