Open teinhonglo opened 12 months ago
Hey Tien-Hong, thanks for writing this issue. Glad to see that this algorithm is useful for you!
I assume that your screenshots include the alignments with the corresponding token score? Inspecting these score for the BPE-level tokens:
Inspection of the Phone-level tokens:
I recommend to re-check the following parameters:
Also other issues may cause such misalignments that may be model-related; Give me a few days to find the time to investigate Branchformer alignments with an English language model.
Thank you for your kind response.
I have evaluated our trained model's CTC performance and verified the audio's sampling rate. The sampling rate of the audio is 16000.
All configs I used are listed:
Both the CTC-ATT and CTC decoding performance (w/ the ctc suffix) of the model (Additionally, I attempted using the conformer-type encoder, but unfortunately, the alignment results remained as poor as the branchformer-type encoder.):
dataset | Snt | Wrd | Corr | Sub | Del | Ins | Err | S.Err |
---|---|---|---|---|---|---|---|---|
decode_asr_branchformer_asr_model_valid.acc.ave/test | 2187 | 37456 | 94.77 | 4.97 | 0.27 | 0.07 | 5.30 | 43.35 |
decode_asr_branchformer_ctc_asr_model_valid.acc.ave/test | 2187 | 37456 | 92.31 | 7.14 | 0.55 | 0.25 | 7.94 | 53.54 |
dataset | Snt | Wrd | Corr | Sub | Del | Ins | Err | S.Err |
---|---|---|---|---|---|---|---|---|
decode_asr_branchformer_asr_model_valid.acc.ave/test | 2187 | 37456 | 94.82 | 5.01 | 0.17 | 0.07 | 5.25 | 42.25 |
decode_asr_branchformer_ctc_asr_model_valid.acc.ave/test | 2187 | 37456 | 94.52 | 5.29 | 0.18 | 0.08 | 5.56 | 44.67 |
Do you have any further suggestions?
Thank you in advance. Tien-Hong
Firstly, I want to express my admiration for the exceptional work accomplished here!
Recently, I've been facing the issue while using the ESPnet2 Branchformer model. Despite following the instructions on here, I encountered poor alignment results. This results occurred when I trained the model with phone-level transcriptions.
To understand this issue further, I experimented with two different token types, the details of which are as follows: The accuracy of the two models is 95+%.
BPE-level tokens:
Phone-level tokens:
I would appreciate your guidance and insights to help me resolve these alignment issues.
Thank you in advance. Tien-Hong