teinhonglo commented 12 months ago

Firstly, I want to express my admiration for the exceptional work accomplished here!

Recently, I've been facing the issue while using the ESPnet2 Branchformer model. Despite following the instructions on here, I encountered poor alignment results. This results occurred when I trained the model with phone-level transcriptions.

To understand this issue further, I experimented with two different token types, the details of which are as follows: The accuracy of the two models is 95+%.

BPE-level tokens:

Phone-level tokens:

I would appreciate your guidance and insights to help me resolve these alignment issues.

Thank you in advance. Tien-Hong

lumaku commented 12 months ago

Hey Tien-Hong, thanks for writing this issue. Glad to see that this algorithm is useful for you!

I assume that your screenshots include the alignments with the corresponding token score? Inspecting these score for the BPE-level tokens:

di55 has a score of 0.00, while nga55 has a score of -4.85, and xiag2 has a score of -8.3677. These probabilities are quite bad.

Inspection of the Phone-level tokens:

here, the token probabilities are mostly -0.000, which is unusually good (but happen with Transformer models) and may indicate a numerical problem?
Timing seems to be shifted by ~300 ms.

I recommend to re-check the following parameters:

The duration of the tokens seems to be unusually long. Maybe the timing variables need to be adapted (also, check the correct sample rate).
Subsampling: CTC accuracy depends on the ratio of tokens to CTC frames. I had good results with 3 frames for each token on average (to get blank tokens classified in-between). If you directly switched from BPE to phones, you may still need to adapt the subsampling ratio.
Check the performance of your CTC network: The alignments are only as good as the CTC output of the network itself. What was the CTC weight parameter during your training? If you decode CTC-only on your test set, how good/bad is the ASR performance compared to hybrid CTC/attention decoding?
Usually, Transformer models loose accuracy at the beginning and at the end of the aligned audio, maybe adding suitable padding to the audio file may help.

Also other issues may cause such misalignments that may be model-related; Give me a few days to find the time to investigate Branchformer alignments with an English language model.

teinhonglo commented 12 months ago

Thank you for your kind response.

I have evaluated our trained model's CTC performance and verified the audio's sampling rate. The sampling rate of the audio is 16000.

All configs I used are listed:

Both the CTC-ATT and CTC decoding performance (w/ the ctc suffix) of the model (Additionally, I attempted using the conformer-type encoder, but unfortunately, the alignment results remained as poor as the branchformer-type encoder.):

BPE-level tokens

exp/asr_train_asr_branchformer_raw_en_bpe735_sp

WER

dataset	Snt	Wrd	Corr	Sub	Del	Ins	Err	S.Err
decode_asr_branchformer_asr_model_valid.acc.ave/test	2187	37456	94.77	4.97	0.27	0.07	5.30	43.35
decode_asr_branchformer_ctc_asr_model_valid.acc.ave/test	2187	37456	92.31	7.14	0.55	0.25	7.94	53.54

Phone-level tokens

exp/asr_train_asr_branchformer_raw_en_word_sp

WER

dataset	Snt	Wrd	Corr	Sub	Del	Ins	Err	S.Err
decode_asr_branchformer_asr_model_valid.acc.ave/test	2187	37456	94.82	5.01	0.17	0.07	5.25	42.25
decode_asr_branchformer_ctc_asr_model_valid.acc.ave/test	2187	37456	94.52	5.29	0.18	0.08	5.56	44.67

Do you have any further suggestions?