k2-fsa / icefall

https://k2-fsa.github.io/icefall/
Apache License 2.0
802 stars 270 forks source link

Error while running compute_fbank_aishell.py #290

Closed is2022 closed 2 years ago

is2022 commented 2 years ago

Hi, while reproducing the Aishell egs, I get the following error. Any ideas what am I doing wrong? Thanks

2022-04-04 13:40:17 (prepare.sh:58:main) stage 0: Download data 2022-04-04 13:40:17 (prepare.sh:89:main) Stage 1: Prepare aishell manifest 2022-04-04 13:40:19 (prepare.sh:100:main) Stage 2: Prepare musan manifest 2022-04-04 13:40:19 (prepare.sh:104:main) It may take 6 minutes 2022-04-04 13:41:34,837 WARNING [qa.py:116] There are 15 recordings that do not have any corresponding supervisions in the SupervisionSet. 2022-04-04 13:43:15 (prepare.sh:112:main) Stage 3: Compute fbank for aishell 2022-04-04 13:43:17,119 INFO [compute_fbank_aishell.py:67] Processing train Traceback (most recent call last): File "./local/compute_fbank_aishell.py", line 111, in compute_fbank_aishell(num_mel_bins=args.num_mel_bins) File "./local/compute_fbank_aishell.py", line 86, in compute_fbank_aishell storage_type=LilcomHdf5Writer, File "/usr/local/lib/python3.6/dist-packages/lhotse/cut.py", line 4483, in compute_and_store_features cut_sets = self.split(num_jobs, shuffle=True) File "/usr/local/lib/python3.6/dist-packages/lhotse/cut.py", line 3635, in split self, num_splits=num_splits, shuffle=shuffle, drop_last=drop_last File "/usr/local/lib/python3.6/dist-packages/lhotse/utils.py", line 334, in split_sequence f"Cannot split iterable into more chunks ({num_splits}) than its number of items {num_items}" ValueError: Cannot split iterable into more chunks (15) than its number of items 0

csukuangfj commented 2 years ago

Did you make any changes to local/compute_fbank_aishell.py?

The error log

ValueError: Cannot split iterable into more chunks (15) than its number of items 0

says that your cut_set is empty. Can you post the output print(len(cut_set))? You can put it just before the following line https://github.com/k2-fsa/icefall/blob/87cf9231ea73631f1e4453400b3be06d45bcebf5/egs/aishell/ASR/local/compute_fbank_aishell.py#L78

csukuangfj commented 2 years ago

2022-04-04 13:40:17 (prepare.sh:89:main) Stage 1: Prepare aishell manifest 2022-04-04 13:40:19 (prepare.sh💯main) Stage 2: Prepare musan manifest

Are you using the latest master? Also, the log shows that your Stage 1 took only 2 seconds, which is unexpected. Can you show the output of

ls -lh data/manifests/

It should print something as follows:

-rw-r--r-- 1 kuangfangjun root 5.2M Mar  8 20:20 recordings_dev.json
-rw-r--r-- 1 kuangfangjun root 234K Mar  8 20:20 recordings_music.json
-rw-r--r-- 1 kuangfangjun root 341K Mar  8 20:20 recordings_noise.json
-rw-r--r-- 1 kuangfangjun root 155K Mar  8 20:20 recordings_speech.json
-rw-r--r-- 1 kuangfangjun root 2.6M Mar  8 20:20 recordings_test.json
-rw-r--r-- 1 kuangfangjun root  44M Mar  8 20:19 recordings_train.json
-rw-r--r-- 1 kuangfangjun root 4.2M Mar  8 20:20 supervisions_dev.json
-rw-r--r-- 1 kuangfangjun root 170K Mar  8 20:20 supervisions_music.json
-rw-r--r-- 1 kuangfangjun root 2.1M Mar  8 20:20 supervisions_test.json
-rw-r--r-- 1 kuangfangjun root  35M Mar  8 20:19 supervisions_train.json
is2022 commented 2 years ago

Thank you very much for the quick reply. My data directory only has the download folder in it, which contains the downloaded musan and aishell folders, together with the lm. drwxr-xr-x. 3 root root 22 Apr 4 13:40 ./ drwxr-xr-x. 11 root root 330 Apr 4 13:40 ../ drwxr-xr-x. 3 root root 44 Apr 4 13:40 download/

csukuangfj commented 2 years ago

How did you run the script ./prepare.sh?

The log

2022-04-04 13:40:17 (prepare.sh:89:main) Stage 1: Prepare aishell manifest
2022-04-04 13:40:19 (prepare.sh💯main) Stage 2: Prepare musan manifest

shows that you have run Stage 1 and Stage 2, which should have produced some manifests files in the folder data/.

is2022 commented 2 years ago

These are the commands in my run.sh. cd egs/aishell/ASR

export LC_ALL=C.UTF-8 export LANG=C.UTF-8

./prepare.sh ./conformer_ctc/train.py --num-epochs 10 ./conformer_ctc/decode.py --method 1best --max-duration 100

csukuangfj commented 2 years ago

My data directory only has the download folder in it

There should be no download directory inside data. download should be in the same folder as ./prepare.sh.

Could you post the output of ls -lh ./download/*? I suspect that you have not downloaded the data yet.

is2022 commented 2 years ago

I put the data in dl_dir: /workspace/icefall/data/download output of "ll data/download/" drwxr-xr-x. 3 root root 44 Apr 4 13:40 ./ drwxr-xr-x. 3 root root 22 Apr 4 13:40 ../ lrwxrwxrwx. 1 root root 74 Apr 4 13:40 aishell drwxr-xr-x. 3 root root 68 Apr 4 13:40 lm/ lrwxrwxrwx. 1 root root 72 Apr 4 13:40 musan

csukuangfj commented 2 years ago

I put the data in dl_dir: /workspace/icefall/data/download

Could you show the changes you made to prepare.sh?

is2022 commented 2 years ago

line 31: dl_dir=/workspace/icefall/data/download

csukuangfj commented 2 years ago

Can you check that your stage 1 is finished successfully? https://github.com/k2-fsa/icefall/blob/87cf9231ea73631f1e4453400b3be06d45bcebf5/egs/aishell/ASR/prepare.sh#L87-L96

is2022 commented 2 years ago

I have the following files in /workspace/icefall/egs/aishell/ASR/data/manifests May be the cd egs/aishell/ASR at te start of run.sh is messing things up?

drwxr-xr-x. 2 root root 4096 Apr 4 14:59 ./ drwxr-xr-x. 4 root root 36 Apr 4 14:59 ../ -rw-r--r--. 1 root root 0 Apr 4 14:55 .aishell_manifests.done -rw-r--r--. 1 root root 0 Apr 4 14:59 .musan_manifests.done -rw-r--r--. 1 root root 2 Apr 4 14:55 recordings_dev.json -rw-r--r--. 1 root root 209677 Apr 4 14:59 recordings_music.json -rw-r--r--. 1 root root 306733 Apr 4 14:59 recordings_noise.json -rw-r--r--. 1 root root 138869 Apr 4 14:59 recordings_speech.json -rw-r--r--. 1 root root 2 Apr 4 14:55 recordings_test.json -rw-r--r--. 1 root root 2 Apr 4 14:55 recordings_train.json -rw-r--r--. 1 root root 2 Apr 4 14:55 supervisions_dev.json -rw-r--r--. 1 root root 173904 Apr 4 14:59 supervisions_music.json -rw-r--r--. 1 root root 2 Apr 4 14:55 supervisions_test.json -rw-r--r--. 1 root root 2 Apr 4 14:55 supervisions_train.json

csukuangfj commented 2 years ago

By default, the script is run using the following commands

cd egs/aishell/ASR 
./prepare.sh

and it generates files in ./data and ./download. You can select a different directory for download by changing dl_dir in prepare.sh, but the folder for data is fixed.

From the output of /workspace/icefall/egs/aishell/ASR/data/manifests, it looks like everything goes as expected. Do you still have the above errors?

is2022 commented 2 years ago

Yes, same errors. By the way, print(len(cut_set)) prints 0.

csukuangfj commented 2 years ago

Oh, wait.

-rw-r--r--. 1 root root 2 Apr 4 14:55 recordings_test.json
-rw-r--r--. 1 root root 2 Apr 4 14:55 recordings_train.json
-rw-r--r--. 1 root root 2 Apr 4 14:55 supervisions_test.json
-rw-r--r--. 1 root root 2 Apr 4 14:55 supervisions_train.json

this does not look right.

Can you check that /workspace/icefall/data/download contains the folders described below? https://github.com/k2-fsa/icefall/blob/87cf9231ea73631f1e4453400b3be06d45bcebf5/egs/aishell/ASR/prepare.sh#L13-L16

is2022 commented 2 years ago

This is inside aishell drwxr-xr-x. 4 1682 500 62 Mar 30 17:03 ./ drwxr-xr-x. 4 1682 500 130 Mar 31 18:50 ../ drwxr-xr-x. 4 1682 500 47 Jun 16 2017 data_aishell/ drwxr-xr-x. 2 1682 500 57 Jun 21 2017 resource_aishell/ and this is inside musan drwxr-xr-x. 5 1682 500 80 Nov 16 2015 ./ drwxr-xr-x. 4 1682 500 130 Mar 31 18:50 ../ -rwxr-xr-x. 1 1682 500 1765 Oct 30 2015 README* drwxr-xr-x. 7 1682 500 128 Oct 30 2015 music/ drwxr-xr-x. 4 1682 500 73 Oct 30 2015 noise/ drwxr-xr-x. 4 1682 500 66 Oct 30 2015 speech/

csukuangfj commented 2 years ago

The reason for the error is that you have empty manifest files for aishell, Your following files are actually empty.

-rw-r--r--. 1 root root 2 Apr 4 14:55 recordings_test.json
-rw-r--r--. 1 root root 2 Apr 4 14:55 recordings_train.json
-rw-r--r--. 1 root root 2 Apr 4 14:55 supervisions_test.json
-rw-r--r--. 1 root root 2 Apr 4 14:55 supervisions_train.json

You can step into the following code https://github.com/k2-fsa/icefall/blob/87cf9231ea73631f1e4453400b3be06d45bcebf5/egs/aishell/ASR/prepare.sh#L87-L96 to see what went wrong.

You have to delete data/manifests/.aishell_manifests.done before going forward.

is2022 commented 2 years ago

I even ran lhotse prepare aishell $dl_dir/aishell data/manifests directly but the above files are still empty.

csukuangfj commented 2 years ago

You can set breakpoints in https://github.com/lhotse-speech/lhotse/blob/master/lhotse/recipes/aishell.py#L72 to debug it.

For instance, change

    corpus_dir = Path(corpus_dir)
    assert corpus_dir.is_dir(), f"No such directory: {corpus_dir}"

to

    import pdb
    pdb.set_trace()
    corpus_dir = Path(corpus_dir)
    assert corpus_dir.is_dir(), f"No such directory: {corpus_dir}"

When you run lhotse prepare aishell $dl_dir/aishell data/manifests, it will enter pdb, you can try to find what is wrong.

is2022 commented 2 years ago

The issue was that I had forgotten to unzip the tar files inside aishell/data_aishell/wav/ After that, the manifests got generated (with some warnings about some missing transcripts) and now it's computing fbank. Thanks a lot for your help! :)

pkufool commented 2 years ago

Sorry, my bad. I encountered such an issue before, and fixed it here (https://github.com/lhotse-speech/lhotse/pull/388). But I forgot to change the if condition in prepare.sh. see https://github.com/k2-fsa/icefall/pull/291.