charisu_attention_aligner uses Wav2Vec2ForAttentionAlignment which uses w2v2 for encoding speech and a BERT for encoding phonemes, and then something really over engineered. The DTW is the correct way to normalize the output of w2v2, and it seems that Wav2Vec2ForAttentionAlignment only exists because DTW was overlooked. This should be depreciated?
charsiu_chain_forced_aligner does w2v2-c2c to get phonemes, then Wav2Vec2ForAttentionAlignment followed by DTW. Perhaps this should be replaced by the charsiu_forced_aligner where the phonemes are obtained from w2v2-c2c.
I keep forgetting which aligner does what so here it is:
Wav2Vec2ForFrameClassification
is just w2v2 with a linear layer head.charsiu_predictive_aligned
takes theargmax
of the logits of theWav2Vec2ForFrameClassification
modelcharsiu_forced_aligner
does g2p on a given transcript, then uses those sequence of phones to index into the logits ofWav2Vec2ForFrameClassification
along thephone_id
axis. Then DTW can be used on the resulting sequence vs time tensor find the alignment. https://github.com/lingjzhu/charsiu/blob/13a69f2a22ca0c0962b75cc693399b0ae23a12c9/src/utils.py#L304-L305charisu_attention_aligner
usesWav2Vec2ForAttentionAlignment
which uses w2v2 for encoding speech and a BERT for encoding phonemes, and then something really over engineered. The DTW is the correct way to normalize the output of w2v2, and it seems thatWav2Vec2ForAttentionAlignment
only exists because DTW was overlooked. This should be depreciated?charsiu_chain_forced_aligner
does w2v2-c2c to get phonemes, thenWav2Vec2ForAttentionAlignment
followed by DTW. Perhaps this should be replaced by thecharsiu_forced_aligner
where the phonemes are obtained from w2v2-c2c.