lumaku / ctc-segmentation

Segment an audio file and obtain utterance alignments. (Python package)
Apache License 2.0
312 stars 28 forks source link

Obtaining unusual alignment results while using the ESPnet2 Branchformer model. #30

Open teinhonglo opened 12 months ago

teinhonglo commented 12 months ago

Firstly, I want to express my admiration for the exceptional work accomplished here!

Recently, I've been facing the issue while using the ESPnet2 Branchformer model. Despite following the instructions on here, I encountered poor alignment results. This results occurred when I trained the model with phone-level transcriptions.

To understand this issue further, I experimented with two different token types, the details of which are as follows: The accuracy of the two models is 95+%.

BPE-level tokens:

image

Phone-level tokens:

image

I would appreciate your guidance and insights to help me resolve these alignment issues.

Thank you in advance. Tien-Hong

lumaku commented 12 months ago

Hey Tien-Hong, thanks for writing this issue. Glad to see that this algorithm is useful for you!

I assume that your screenshots include the alignments with the corresponding token score? Inspecting these score for the BPE-level tokens:

Inspection of the Phone-level tokens:

I recommend to re-check the following parameters:

Also other issues may cause such misalignments that may be model-related; Give me a few days to find the time to investigate Branchformer alignments with an English language model.

teinhonglo commented 12 months ago

Thank you for your kind response.

I have evaluated our trained model's CTC performance and verified the audio's sampling rate. The sampling rate of the audio is 16000.

All configs I used are listed:

Both the CTC-ATT and CTC decoding performance (w/ the ctc suffix) of the model (Additionally, I attempted using the conformer-type encoder, but unfortunately, the alignment results remained as poor as the branchformer-type encoder.):

BPE-level tokens

exp/asr_train_asr_branchformer_raw_en_bpe735_sp

WER

dataset Snt Wrd Corr Sub Del Ins Err S.Err
decode_asr_branchformer_asr_model_valid.acc.ave/test 2187 37456 94.77 4.97 0.27 0.07 5.30 43.35
decode_asr_branchformer_ctc_asr_model_valid.acc.ave/test 2187 37456 92.31 7.14 0.55 0.25 7.94 53.54

Phone-level tokens

exp/asr_train_asr_branchformer_raw_en_word_sp

WER

dataset Snt Wrd Corr Sub Del Ins Err S.Err
decode_asr_branchformer_asr_model_valid.acc.ave/test 2187 37456 94.82 5.01 0.17 0.07 5.25 42.25
decode_asr_branchformer_ctc_asr_model_valid.acc.ave/test 2187 37456 94.52 5.29 0.18 0.08 5.56 44.67

Do you have any further suggestions?

Thank you in advance. Tien-Hong