Open Xiaobin-Rong opened 3 months ago
Hi @Xiaobin-Rong, could you check the tmp/dns5_clean_read_speech.json
file generated by prepare_espnet_data.sh
? It should contain two keys book_03287_chp_0002_reader_03297_7_seg_0_seg1
and book_03287_chp_0002_reader_03297_7_seg_0_seg1(2)
, which should avoid the issue you are encountering.
The above is handled by https://github.com/urgent-challenge/urgent2024_challenge/blob/main/utils/estimate_audio_bandwidth.py#L122-L131. Could you check whether your codebase contain the same script?
I think I've found the issue. The uid
is reformatted in the subsequent script resample_to_estimated_bandwidth.py
, so the above handling is overwritten. Let me try to fix this.
@Emrys365
Thanks for your prompt response. Yes, the tmp/dns5_clean_read_speech.json
file indeed contain two keys book_03287_chp_0002_reader_03297_7_seg_0_seg1
and book_03287_chp_0002_reader_03297_7_seg_0_seg1(2)
as you said, and my codebase contains the same lines as https://github.com/urgent-challenge/urgent2024_challenge/blob/main/utils/estimate_audio_bandwidth.py#L122-L131. But the tmp/dns5_clean_read_speech_resampled.scp
file has the same uid
: book_03287_chp_0002_reader_03297_7_seg_0_seg1
.
So it seems that something wrong occurred when the codes executed: https://github.com/urgent-challenge/urgent2024_challenge/blob/4933e4f914f0aab652aa2db2867489cd4c8531e7/utils/prepare_DNS5_librivox_speech.sh#L40-L49
I think I've found the issue. The
uid
is reformatted in the subsequent scriptresample_to_estimated_bandwidth.py
, so the above handling is overwritten. Let me try to fix this.
OK, thanks for your effort and dedication!
@Xiaobin-Rong This issues should be addressed now. You will need to rerun the data preparation for DNS5 speech data and regenerate the training configuration before simulating the training set.
@Emrys365 Thanks for your quick response. I will try regenerating the data. However, I'm afraid that not only the DNS5 speech data but also the CommonVoice data are required to regenerated.
By the way, I would like to ask what's the purpose of these code lines (https://github.com/urgent-challenge/urgent2024_challenge/blob/main/prepare_espnet_data.sh#L126-L135), as it seems that the generated ${output_dir}/speech_train/wav.scp
and other files (speech_train/utt2spk
, speech_train/text
, speech_train/spk1.scp
, etc) are not used when simulating the training set. Could you please clarify?
In my expectation, the CommonVoice data should be readily working. You could check the generated tmp/commonvoice_11.0_en.json
file and it should contain no parentheses in its keys, indicating no name conflicts in the audio samples.
The speech_train
subset is intended for dynamic mixing style training in ESPnet, which is more complicated but may improve the performance. Instead of using a fixed training set (stored in data/train
, simulated by simulation/simulate_data_from_param.py
), you can specific --train_set speech_train --enh_config conf/tuning/xxx_dynamic_mixing.yaml
in run.sh
to train a model with on-the-fly simulated training samples.
Different from the fixed
train
subset, thespeech_train
subset will share the sample content inwav.scp
(degraded speech) andspk1.scp
(clean reference speech) so that the degraded speech will be augmented on the fly according to the configuration inconf/tuning/xxx_dynamic_mixing.yaml
.However, this is only designed for the ESPnet toolkit. If you have other preferences, you could also adapt the data structure in
data/
for other toolkits that you are more familiar with.
@Emrys365 OK, all my confusion is now completely resolved. Thank you again for your patient explanation!
I came across an error when running
generate_data_param.py
, and thescp
files used in thesimulation_train.yaml
are presented as below:All of these are generated by running
prepare_espnet_data.sh
.Specifically, the error is related to line 95 in
generate_data_param.py
:, where you prohibit the situation of two identical
uid
in***train.scp
files. However, I discovered that there indeed truly exist some audios having the same name but with different paths, resulting in the sameuid
. For instance:dns5_fullband/Track1_Headset/mnt/dnsv5/clean/read_speech/read_speech/00002e/book_03287_chp_0002_reader_03297_7_seg_0_seg1.wav
anddns5_fullband/Track1_Headset/mnt/dnsv5/clean/read_speech/000019/book_03287_chp_0002_reader_03297_7_seg_0_seg1.wav
.I merely modified the code
assert uid not in speech_dic[int(fs)], (uid, fs)
to,enabling it to keep running without taking into account the assertion. It appears that there are a large number of audios with similar conditions as mentioned above. I am wondering if the aforesaid problem is a common one, and if my modification is reasonable.