k2-fsa / icefall

https://k2-fsa.github.io/icefall/
Apache License 2.0
951 stars 300 forks source link

Extract framewise alignment information by the pretrained model #188

Open TianyuCao opened 2 years ago

TianyuCao commented 2 years ago

Hi,

I am new to Icefall. I would like to extract framewise alignment information like what is shown in #39 with the pretrained model from: https://huggingface.co/csukuangfj/icefall-asr-librispeech-conformer-ctc-jit-bpe-500-2021-11-09. I tried to follow README.MD in egs/librispeech/ASR/conformer_ctc/README.md. However, when I tried to run "ali.py" in egs/librispeech/ASR/conformer_ctc/ali.py by the usage "./conformer_ctc/ali.py --exp-dir ./conformer_ctc/exp --lang-dir ./data/lang_bpe_500 --epoch 20 --avg 10 --max-duration 300 --dataset train-clean-100 --out-dir data/ali", I found there are no checkpoint files in, e.g, /conformer_ctc/exp, uploaded for the pretrained model to average.

I wonder whether I missed something and/or where I can find an example to extract framewise alignment information by the pretrained model to get a similar results shown in #39. Many thanks for your help in advance!

csukuangfj commented 2 years ago

Could you first follow the README.md in https://huggingface.co/csukuangfj/icefall-asr-librispeech-conformer-ctc-jit-bpe-500-2021-11-09 to download the pre-trained model?

The pre-trained model is called pretrained.pt. You can create a symlink to it in conformer_ctc/exp/epoch-999.pt and use --epoch 999 --avg 1 when invoking conformer_ctc/ali.py.

TianyuCao commented 2 years ago

Thank you for your clarifications! I can now obtain three files aux_labels_test-clean.h5, labels_test-clean.h5 and cuts_test-clean by using --epoch 999 --avg 1 when invoking conformer_ctc/ali.py. However, when trying to read the data from aux_labels_test-clean.h5 for the test audio in #39, e.g., librispeech/LibriSpeech/test-clean/8224/274384/8224-274384-0008.flac, i just something like this

[4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 286, 0, 0, 0, 0, 0, 298, 0, 0, 0, 0, 276, 0, 12, 0, 0, 5, 0, 0, 28, 12, 0, 27, 0, 0, 0, 0, 209, 0, 0, 0, 0, 0, 0, 15, 0, 0, 0, 0, 0, 59, 0, 0, 0, 0, 210, 0, 0, 0, 0, 10, 0, 0, 0, 0, 0, 134, 0, 0, 58, 0, 0, 72, 0, 0, 0, 0, 161, 0, 0, 340, 0, 0, 0, 207, 0, 0, 16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 187, 0, 150, 0, 8, 0, 0, 0, 0, 42, 0, 0, 0, 0, 74, 0, 0, 0, 0, 66, 0, 0, 0, 0, 0, 0, 0, 263, 0, 0, 0, 0, 0, 29, 0, 0, 0, 78, 0, 0, 38, 0, 29, 0, 0, 0, 209, 0, 0, 0, 0, 10, 0, 0, 0, 4, 0, 0, 0, 167, 0, 0, 14, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 236, 0, 0, 0, 0, 0, 10, 0, 0, 0, 4, 0, 0, 0, 139, 0, 13, 0, 0, 275, 0, 0, 29, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 137, 0, 0, 0, 92, 0, 0, 0, 0, 4, 0, 0, 0, 0, 59, 0, 3, 0, 48, 0, 17, 0, 0, 0, 0, 0, 0, 0, 0, 0, 110, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 42, 0, 0, 0, 17, 0, 0, 29, 0, 0, 0, 62, 0, 0, 0, 0, 0, 127, 0, 0, 58, 0, 8, 0, 0, 0, 42, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

After checking with tokens.txt, I found the transcript. Just wondering what time slot is between two elements in this list to calculate the corresponding time with the word in the transcript. Many thanks for your help in advance!

csukuangfj commented 2 years ago

However, when trying to read the data from aux_labels_test-clean.h5 for the test audio in #39, e.g., librispeech/LibriSpeech/test-clean/8224/274384/8224-274384-0008.flac, i just something like this

[4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 286, 0, 0, 0, 0, 0, 298, 0, 0, 0, 0, 276, 0, 12, 0, 0, 5, 0, 0, 28, 12, 0, 27, 0, 0, 0, 0, 209, 0, 0, 0, 0, 0, 0, 15, 0, 0, 0, 0, 0, 59, 0, 0, 0, 0, 210, 0, 0, 0, 0, 10, 0, 0, 0, 0, 0, 134, 0,

These numbers are the corresponding token IDs for the output frames. To get the results of https://github.com/k2-fsa/icefall/pull/39, you have to do a few extra things.

(1) Note the subsampling factor of the model is 4, so output frames 0, 1, 2 correspond to input frames 0, 4, 8. You have to use interpolation to get the alignments for input frames 1, 2, 3, 5, 6, 7, etc.

(2) The default frame shift is 10ms and you can convert the output frame index to time in seconds by multiplying 0.01

(3) You have to use tokens.txt to map those integer token IDs to the corresponding symbols.

Just wondering what time slot is between two elements in this list to calculate the corresponding time with the word in the transcript

The time slot between two consecutive output frames is 0.04 s. As we are using wordpieces and wordpieces of a word start with an underscore _, you can use this information to find the starting frame of a word. Unfortunately, it is not easy to find the ending frame of a word.

TianyuCao commented 2 years ago

Thank you very much for your detailed explanations. I have obtained almost the same results as #39 except the fact that for the first wordpiece ▁THE whose token ID is 4, if we use 0*0.04=0, then we will get the first word The starts immediately in this audio, which does not match the alignment information from https://github.com/CorentinJ/librispeech-alignments (0.500s) or the result shows in #39 (0.48s).

['▁THE', '▁GOOD', '▁NA', 'TURE', 'TURE', 'D', 'D', 'D', '▁A', '▁A', 'U', 'D', 'D', 'I', 'I', 'ENCE', '▁IN', '▁P', 'ITY', '▁TO', '▁FA', 'LL', 'LL', 'EN', '▁MA', 'J', 'J', 'EST', 'Y', 'Y', '▁SH', 'OW', 'ED', 'ED', '▁FOR', '▁ON', 'CE', 'CE', '▁GREAT', 'ER', '▁DE', 'F', 'F', 'ER', 'ER', 'ENCE', '▁TO', '▁THE', '▁K', 'ING', 'ING', '▁THAN', '▁TO', '▁THE', '▁MI', 'N', 'IST', 'ER', '▁AND', '▁SU', 'NG', 'NG', '▁THE', '▁P', '▁P', 'S', 'S', 'AL', 'AL', 'M', 'M', '▁WHICH', '▁THE', '▁FOR', 'M', 'ER', '▁HAD', '▁CA', 'LL', 'LL', 'ED', 'ED', '▁FOR'] [0.0, 0.64, 0.88, 1.08, 1.12, 1.16, 1.2, 1.24, 1.28, 1.32, 1.4000000000000001, 1.44, 1.48, 1.52, 1.56, 1.72, 2.0, 2.24, 2.44, 2.64, 2.88, 3.0, 3.04, 3.12, 3.3200000000000003, 3.44, 3.48, 3.6, 3.72, 3.7600000000000002, 4.68, 4.76, 4.84, 4.88, 5.04, 5.24, 5.44, 5.48, 5.76, 6.0, 6.16, 6.28, 6.32, 6.36, 6.4, 6.5200000000000005, 6.72, 6.88, 7.04, 7.16, 7.2, 7.88, 8.120000000000001, 8.28, 8.44, 8.52, 8.64, 8.76, 9.64, 9.92, 10.08, 10.120000000000001, 10.28, 10.48, 10.52, 10.56, 10.6, 10.64, 10.68, 10.72, 10.76, 11.120000000000001, 11.4, 11.56, 11.72, 11.84, 12.0, 12.24, 12.36, 12.4, 12.44, 12.48, 12.6]

Could I have any chance to know how you determine the starting frame for the first word in the general case?

danpovey commented 2 years ago

The alignment is never going to be exact in any end-to-end setup, especially one like transformers that consumes unlimited left/right context.

Jianjie-Shi commented 2 years ago

Hi guys,

I also met the same problem when doing alignment by myself. Coudl I ask why the model #17 using in #39 can determine the starting frame for the first word the accurately, e.g., 0.48s compared with the ground truth 0.5s, while the pretrained model from: https://huggingface.co/csukuangfj/icefall-asr-librispeech-conformer-ctc-jit-bpe-500-2021-11-09 got a worse result, e.g., 0s compared with 0.5s?

It seems that both models in #17 and pretrained model rom: https://huggingface.co/csukuangfj/icefall-asr-librispeech-conformer-ctc-jit-bpe-500-2021-11-09 are ctc model. What possible modifications between these two models lead to this result?

danpovey commented 2 years ago

Too-powerful models can give poor alignments as they transform the data too much. Often the best alignments are from GMM systems.

On Fri, Jan 28, 2022 at 2:13 AM Jianjie-Shi @.***> wrote:

Hi guys,

I also met the same problem when doing alignment by myself. Coudl I ask why the model #17 https://github.com/k2-fsa/icefall/pull/17 using in #39 https://github.com/k2-fsa/icefall/pull/39 can determine the starting frame for the first word the accurately, e.g., 0.48s compared with the ground truth 0.5s, while the pretrained model from: https://huggingface.co/csukuangfj/icefall-asr-librispeech-conformer-ctc-jit-bpe-500-2021-11-09 got a worse result, e.g., 0s compared with 0.5s?

It seems that both models in #17 https://github.com/k2-fsa/icefall/pull/17 and pretrained model rom: https://huggingface.co/csukuangfj/icefall-asr-librispeech-conformer-ctc-jit-bpe-500-2021-11-09 are ctc model. What possible modifications between these two models lead to this result?

— Reply to this email directly, view it on GitHub https://github.com/k2-fsa/icefall/issues/188#issuecomment-1023508928, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO24NBJ3PNRGUXPYI6TUYGDNDANCNFSM5MSXYAOA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you commented.Message ID: @.***>

csukuangfj commented 2 years ago

Could I ask why the model #17 using in #39 can determine the starting frame for the first word the accurately

Could you try the pre-trained model from the following repo? https://github.com/csukuangfj/icefall-asr-conformer-ctc-bpe-500

That model has a higher WER on test-clean than the one from https://huggingface.co/csukuangfj/icefall-asr-librispeech-conformer-ctc-jit-bpe-500-2021-11-09, i.e., 2.56 vs 2.42

They basically have the same model configuration, i.e., you can load the pre-trained model with the same code without modifications. I just tried it on the first utterance of test-clean using the master branch and the following shows the debug output:

(Pdb) p supervisions
{'text': ["NO I'VE MADE UP MY MIND ABOUT IT IF I'M MABEL I'LL STAY DOWN HERE"], 'sequence_idx': tensor([0], dtype=torch.int32), 'start_frame': tensor([0], dtype=torch.int32), 'num_frames': tensor([487], dtype=torch.int32), 'cut': [MonoCut(id='260-123440-0011-1193-0',
start=0, duration=4.87, channel=0, supervisions=[SupervisionSegment(id='260-123440-0011', recording_id='260-123440-0011', start=0.0, duration=4.87, channel=0, text="NO I'VE MADE UP MY MIND ABOUT IT IF I'M MABEL I'LL STAY DOWN HERE", language='English', speaker='260',
gender=None, custom=None, alignment=None)], features=Features(type='fbank', num_frames=487, num_features=80, frame_shift=0.01, sampling_rate=16000, start=0, duration=4.87, storage_type='lilcom_hdf5', storage_path='data/fbank/feats_test-clean/feats-5.h5', storage_key='575aacae-38c5-45ec-9db9-0e3085e490be', recording_id=None, channels=0), recording=Recording(id='260-123440-0011', sources=[AudioSource(type='file', channels=[0], source='data/LibriSpeech/test-clean/260/123440/260-123440-0011.flac')], sampling_rate=16000, num_samples=77920, duration=4.87, transforms=None), custom=None)]}
(Pdb) p labels_ali
[[0, 0, 0, 0, 0, 0, 0, 94, 0, 0, 0, 0, 0, 0, 19, 45, 45, 75, 0, 300, 0, 0, 0, 0, 176, 0, 0, 105, 0, 0, 0, 139, 0, 0, 68, 0, 0, 0, 0, 250, 0, 0, 0, 0, 0, 30, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 165, 0, 0, 0, 0, 19, 0, 45, 45, 17, 0, 0, 161, 0, 0, 41, 41, 131, 131, 0, 0, 0, 0, 0, 19, 0, 0, 45, 58, 58, 58, 0, 0, 0, 0, 277, 0, 0, 0, 16, 16, 0, 0, 294, 0, 0, 0, 0, 0, 0, 22, 0, 0, 26, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

You can see the first token does not start at the very beginning. And you can compare the timestamps from https://github.com/CorentinJ/librispeech-alignments I list them below for easier reference.

260-123440-0011 ",NO,I'VE,MADE,UP,MY,MIND,ABOUT,IT,,IF,,I'M,MABEL,,I'LL,STAY,DOWN,HERE," "0.220,0.600,0.760,0.950,1.060,1.190,1.480,1.840,2.000,2.220,2.440,2.470,2.720,3.200,3.230,3.500,3.950,4.220,4.660,4.87"


Dan's comment may explain why those two models produce different alignments.

Too-powerful models can give poor alignments as they transform the data too much. Often the best alignments are from GMM systems.

TianyuCao commented 2 years ago

Sorry to bother you again. I just wonder whether the pretrained model can be used to Extract framewise alignment information for our own datasets now? I can see in ali.py, only datasets in LibriSpeech can be used to compute alignments. If I need to compute alignments for my own datasets, what steps should I do, e.g., generate fbank and manifests for my datasets?

parser.add_argument( "--dataset", type=str, required=True, help="""The name of the dataset to compute alignments for. Possible values are:

  • test-clean.
  • test-other
  • train-clean-100
  • train-clean-360
  • train-other-500
  • dev-clean
  • dev-other """, )
csukuangfj commented 2 years ago

I just wonder whether the pretrained model can be used to Extract framewise alignment information for our own datasets now?

You can try that and look at the resulting alignments. You will probably need to train your own model.


If I need to compute alignments for my own datasets, what steps should I do, e.g., generate fbank and manifests for my datasets?

Possible steps are: (1) Prepare your data. Please see https://lhotse.readthedocs.io/en/latest/corpus.html#adding-new-corpora for more information. You can find various recipes for different datasets in https://github.com/lhotse-speech/lhotse/tree/master/lhotse/recipes

(2) Follow https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/prepare.sh to extract features for your dataset

(3) Adapt https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/tdnn_lstm_ctc/asr_datamodule.py to your dataset

(4) Train a model for your dataset. Please see https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/conformer_ctc/train.py

(5) Get alignments. Please see https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/conformer_ctc/ali.py