Error when running `generate_data_param.py`

Xiaobin-Rong commented 3 months ago

I came across an error when running generate_data_param.py, and the scp files used in the simulation_train.yaml are presented as below:

speech_scps:
- /data/tmp/dns5_clean_read_speech_resampled_filtered_train.scp
- /data/tmp/libritts_resampled_train.scp
- /data/tmp/vctk_train.scp
- /data/tmp/commonvoice_11.0_en_resampled_filtered_train.scp

All of these are generated by running prepare_espnet_data.sh.

Specifically, the error is related to line 95 in generate_data_param.py:

assert uid not in speech_dic[int(fs)], (uid, fs)

, where you prohibit the situation of two identical uid in ***train.scp files. However, I discovered that there indeed truly exist some audios having the same name but with different paths, resulting in the same uid. For instance: dns5_fullband/Track1_Headset/mnt/dnsv5/clean/read_speech/read_speech/00002e/book_03287_chp_0002_reader_03297_7_seg_0_seg1.wav and dns5_fullband/Track1_Headset/mnt/dnsv5/clean/read_speech/000019/book_03287_chp_0002_reader_03297_7_seg_0_seg1.wav.

I merely modified the code assert uid not in speech_dic[int(fs)], (uid, fs) to

if uid in speech_dic[int(fs)]:
    print(f"[Warning] {scp} {uid} {fs}")

,enabling it to keep running without taking into account the assertion. It appears that there are a large number of audios with similar conditions as mentioned above. I am wondering if the aforesaid problem is a common one, and if my modification is reasonable.

Emrys365 commented 3 months ago

Hi @Xiaobin-Rong, could you check the tmp/dns5_clean_read_speech.json file generated by prepare_espnet_data.sh? It should contain two keys book_03287_chp_0002_reader_03297_7_seg_0_seg1 and book_03287_chp_0002_reader_03297_7_seg_0_seg1(2), which should avoid the issue you are encountering.

The above is handled by https://github.com/urgent-challenge/urgent2024_challenge/blob/main/utils/estimate_audio_bandwidth.py#L122-L131. Could you check whether your codebase contain the same script?

Emrys365 commented 3 months ago

I think I've found the issue. The uid is reformatted in the subsequent script resample_to_estimated_bandwidth.py, so the above handling is overwritten. Let me try to fix this.

Xiaobin-Rong commented 3 months ago

@Emrys365 Thanks for your prompt response. Yes, the tmp/dns5_clean_read_speech.json file indeed contain two keys book_03287_chp_0002_reader_03297_7_seg_0_seg1 and book_03287_chp_0002_reader_03297_7_seg_0_seg1(2) as you said, and my codebase contains the same lines as https://github.com/urgent-challenge/urgent2024_challenge/blob/main/utils/estimate_audio_bandwidth.py#L122-L131. But the tmp/dns5_clean_read_speech_resampled.scp file has the same uid: book_03287_chp_0002_reader_03297_7_seg_0_seg1.

So it seems that something wrong occurred when the codes executed: https://github.com/urgent-challenge/urgent2024_challenge/blob/4933e4f914f0aab652aa2db2867489cd4c8531e7/utils/prepare_DNS5_librivox_speech.sh#L40-L49

Xiaobin-Rong commented 3 months ago

I think I've found the issue. The uid is reformatted in the subsequent script resample_to_estimated_bandwidth.py, so the above handling is overwritten. Let me try to fix this.

OK, thanks for your effort and dedication!

Emrys365 commented 3 months ago

@Xiaobin-Rong This issues should be addressed now. You will need to rerun the data preparation for DNS5 speech data and regenerate the training configuration before simulating the training set.

Xiaobin-Rong commented 3 months ago

@Emrys365 Thanks for your quick response. I will try regenerating the data. However, I'm afraid that not only the DNS5 speech data but also the CommonVoice data are required to regenerated.

By the way, I would like to ask what's the purpose of these code lines (https://github.com/urgent-challenge/urgent2024_challenge/blob/main/prepare_espnet_data.sh#L126-L135), as it seems that the generated ${output_dir}/speech_train/wav.scp and other files (speech_train/utt2spk, speech_train/text, speech_train/spk1.scp, etc) are not used when simulating the training set. Could you please clarify?

Emrys365 commented 3 months ago

In my expectation, the CommonVoice data should be readily working. You could check the generated tmp/commonvoice_11.0_en.json file and it should contain no parentheses in its keys, indicating no name conflicts in the audio samples.

The speech_train subset is intended for dynamic mixing style training in ESPnet, which is more complicated but may improve the performance. Instead of using a fixed training set (stored in data/train, simulated by simulation/simulate_data_from_param.py), you can specific --train_set speech_train --enh_config conf/tuning/xxx_dynamic_mixing.yaml in run.sh to train a model with on-the-fly simulated training samples.

Different from the fixed train subset, the speech_train subset will share the sample content in wav.scp (degraded speech) and spk1.scp (clean reference speech) so that the degraded speech will be augmented on the fly according to the configuration in conf/tuning/xxx_dynamic_mixing.yaml.

However, this is only designed for the ESPnet toolkit. If you have other preferences, you could also adapt the data structure in data/ for other toolkits that you are more familiar with.

Xiaobin-Rong commented 3 months ago

@Emrys365 OK, all my confusion is now completely resolved. Thank you again for your patient explanation!

urgent-challenge / urgent2024_challenge

Error when running `generate_data_param.py` #8