preparing the dataset uwb atcc: ValueError with n_train, n_test = _validate_shuffle_split

sarranetor commented 1 year ago

I extracted the uwb_atcc database ens extracted it in ZCU_CZ_ATC, putting wav file in audio folder and others in transcripts folder. I run the bash script:

bash data/databases/uwb_atcc/data_prepare_uwb_atcc_corpus.sh

there were 0 empty or blank utterances printing the text file in: experiments/data/uwb_atcc/prep/text2_raw_spk printing the text and tags file in: experiments/data/uwb_atcc/prep/utt2speakerid

Traceback (most recent call last): File "data/utils/gen_train_test.py", line 61, in main() File "data/utils/gen_train_test.py", line 42, in main x_train, x_test = train_test_split( File "/home/liv/Developer/venv3.8-tf2/lib/python3.8/site-packages/sklearn/model_selection/_split.py", line 2562, in train_test_split n_train, n_test = _validate_shuffle_split( File "/home/liv/Developer/venv3.8-tf2/lib/python3.8/site-packages/sklearn/model_selection/_split.py", line 2236, in _validate_shuffle_split raise ValueError( ValueError: With n_samples=0, test_size=None and train_size=0.8, the resulting train set will be empty. Adjust any of the aforementioned parameters.

what can it be? I am not sure?

JuanPZuluaga commented 1 year ago

Hello!

Sorry for taking so much time to reply. Could you put here the output of experiments/data/uwb_atcc/prep/text2_raw_spk?

It looks like the dataset is not getting correctly prepared at some point. That's probably why you're getting that error about '0' samples.

Pablo

damnfarooq commented 1 year ago

I am getting the same error as sarranetor on VScode.

Although on Google colab this is what i got: Empty utterance: uwb-atcc_APP-O8f26B_000225_000353 uwb-atcc_APP-O8f26B 2.260 3.530 [air] Empty utterance: uwb-atcc_TWR-JfoHJv_000234_000308 uwb-atcc_TWR-JfoHJv 2.340 3.080 [air] Empty utterance: uwb-atcc_TWR-VIKx0e_000309_000408 uwb-atcc_TWR-VIKx0e 3.090 4.080 [air] Empty utterance: uwb-atcc_TWR-v1sw0a_002796_002867 uwb-atcc_TWR-v1sw0a 27.960 28.670 [air] Empty utterance: uwb-atcc_TWR-q2wbS3_000336_000419 uwb-atcc_TWR-q2wbS3 3.360 4.190 [air] Empty utterance: uwb-atcc_TWR-ZfmGUj_002451_002578 uwb-atcc_TWR-ZfmGUj 24.510 25.780 [ground] Empty utterance: uwb-atcc_ACCU-OvZN8g_001321_001433 uwb-atcc_ACCU-OvZN8g 13.210 14.330 [ground] there were 492 empty or blank utterances printing the text file in: experiments/data/uwb_atcc/prep/text2_raw_spk printing the text and tags file in: experiments/data/uwb_atcc/prep/utt2speakerid printing the TRAIN SET IDS file in: experiments/data/uwb_atcc/train/ids printing the TEST SET IDS file in: experiments/data/uwb_atcc/test/ids creating the train folder: creating the test folder: UWB-ATCC corpus was sucessfully created

but the next bash gave error as follows:

!bash /content/drive/MyDrive/w2v2-air-traffic/ablations/uwb_atcc/train_w2v2_base.sh

Vocabulary is empty, you passed config/vocab.json We will create a new vocabulary from the training+test file About to start the training output folder: experiments/results/baselines/wav2vec2-base/uwb_atcc/0.0ld_0.0ad_0.0attd_0.0fpd_0.01mtp_12mtl_0.0mfp_12mfl_2acc/ python3: can't open file '/content/drive/MyDrive/w2v2-air-traffic/data/src/run_speech_recognition_ctc.py': [Errno 2] No such file or directory Done training of baseline model for UWB-ATCC database

damnfarooq commented 1 year ago

Please help me fix the issue

JuanPZuluaga commented 1 year ago

Hello,

I think the error might be here:

https://github.com/idiap/w2v2-air-traffic/blob/25b91bfc1975d749bdad76bd6bd3b73cc140b2f7/src/run_asr_fine_tuning.sh#L18

Could you modify that value to:

cmd=""

And let me know what is the output?

damnfarooq commented 1 year ago

Yeah sure, give me 5 minutes

damnfarooq commented 1 year ago

I cant believe it actually worked damnn!!! I would have taken years to solve this issue, Thanks means a lot; This is the output I obtained, expected 492 EMPTY UTTERANCES? ... .. Empty utterance: uwb-atcc_APP-ZWCH4J_024924_025001 uwb-atcc_APP-ZWCH4J 249.240 250.010 [air] Empty utterance: uwb-atcc_ACCU-bNGNLz_003137_003210 uwb-atcc_ACCU-bNGNLz 31.370 32.100 [air] Empty utterance: uwb-atcc_ACCU-bNGNLz_003286_003369 uwb-atcc_ACCU-bNGNLz 32.860 33.690 [air] Empty utterance: uwb-atcc_ACCU-bNGNLz_007525_007604 uwb-atcc_ACCU-bNGNLz 75.250 76.040 [air] Empty utterance: uwb-atcc_TWR-yGF8s5_001233_001335 uwb-atcc_TWR-yGF8s5 12.330 13.350 [ground] Empty utterance: uwb-atcc_TWR-m8CuYQ_001384_001432 uwb-atcc_TWR-m8CuYQ 13.840 14.320 [ground] Empty utterance: uwb-atcc_APP-oDje3J_000356_000642 uwb-atcc_APP-oDje3J 3.560 6.420 [air] Empty utterance: uwb-atcc_ACCU-y5UHTs_001414_001489 uwb-atcc_ACCU-y5UHTs 14.140 14.890 [air] Empty utterance: uwb-atcc_ACCU-WGx5qF_001133_001227 uwb-atcc_ACCU-WGx5qF 11.330 12.270 [air] Empty utterance: uwb-atcc_APP-sgKjJz_000308_000379 uwb-atcc_APP-sgKjJz 3.080 3.790 [air] Empty utterance: uwb-atcc_ACCU-QfXh8b_001887_002017 uwb-atcc_ACCU-QfXh8b 18.870 20.170 [air] Empty utterance: uwb-atcc_ACCU-hh8aIj_002189_002308 uwb-atcc_ACCU-hh8aIj 21.890 23.080 [air] Empty utterance: uwb-atcc_TWR-mDzIBi_005798_005958 uwb-atcc_TWR-mDzIBi 57.980 59.580 [ground] Empty utterance: uwb-atcc_TWR-mDzIBi_006064_006149 uwb-atcc_TWR-mDzIBi 60.640 61.490 [air] Empty utterance: uwb-atcc_APP-wBCMEy_000319_000823 uwb-atcc_APP-wBCMEy 3.190 8.230 [ground] Empty utterance: uwb-atcc_APP-wBCMEy_000973_001205 uwb-atcc_APP-wBCMEy 9.730 12.050 [air] Empty utterance: uwb-atcc_TWR-952cUz_000760_000853 uwb-atcc_TWR-952cUz 7.600 8.540 [air] Empty utterance: uwb-atcc_ACCU-OvZN8g_001321_001433 uwb-atcc_ACCU-OvZN8g 13.210 14.330 [ground] Empty utterance: uwb-atcc_TWR-v1sw0a_002796_002867 uwb-atcc_TWR-v1sw0a 27.960 28.670 [air] Empty utterance: uwb-atcc_TWR-q2wbS3_000336_000419 uwb-atcc_TWR-q2wbS3 3.360 4.190 [air] Empty utterance: uwb-atcc_TWR-ZfmGUj_002451_002578 uwb-atcc_TWR-ZfmGUj 24.510 25.780 [ground] Empty utterance: uwb-atcc_TWR-VIKx0e_000309_000408 uwb-atcc_TWR-VIKx0e 3.090 4.080 [air] Empty utterance: uwb-atcc_TWR-JfoHJv_000234_000308 uwb-atcc_TWR-JfoHJv 2.340 3.080 [air] Empty utterance: uwb-atcc_APP-O8f26B_000225_000353 uwb-atcc_APP-O8f26B 2.260 3.530 [air] Empty utterance: uwb-atcc_ACCU-FVpUok_000535_000633 uwb-atcc_ACCU-FVpUok 5.350 6.330 [air] Empty utterance: uwb-atcc_TWR-jBaOYB_000549_000906 uwb-atcc_TWR-jBaOYB 5.490 9.060 [air] Empty utterance: uwb-atcc_TWR-jBaOYB_001009_001329 uwb-atcc_TWR-jBaOYB 10.090 13.290 [ground] Empty utterance: uwb-atcc_TWR-jBaOYB_001398_001537 uwb-atcc_TWR-jBaOYB 13.980 15.370 [air] Empty utterance: uwb-atcc_APP-Q1M9yF_005349_005423 uwb-atcc_APP-Q1M9yF 53.490 54.230 [air] there were 492 empty or blank utterances printing the text file in: experiments/data/uwb_atcc/prep/text2_raw_spk printing the text and tags file in: experiments/data/uwb_atcc/prep/utt2speakerid printing the TRAIN SET IDS file in: experiments/data/uwb_atcc/train/ids printing the TEST SET IDS file in: experiments/data/uwb_atcc/test/ids creating the train folder: creating the test folder: UWB-ATCC corpus was sucessfully created

JuanPZuluaga commented 1 year ago

Wow, I'm happy it worked :-)

It is expected, not all samples have a transcripts. I'll update that part of the code right now!

Thanks for this. I'm closing this, feel free to open it again.

JiweiTian commented 1 year ago

The same issue with sarranetor: I extracted the uwb_atcc database ens extracted it in ZCU_CZ_ATC, putting wav file in audio folder and others in transcripts folder. I run the bash script:

bash data/databases/uwb_atcc/data_prepare_uwb_atcc_corpus.sh

there were 0 empty or blank utterances printing the text file in: experiments/data/uwb_atcc/prep/text2_raw_spk printing the text and tags file in: experiments/data/uwb_atcc/prep/utt2speakerid

Traceback (most recent call last): File "data/utils/gen_train_test.py", line 61, in main() File "data/utils/gen_train_test.py", line 42, in main x_train, x_test = train_test_split( File "/home/liv/Developer/venv3.8-tf2/lib/python3.8/site-packages/sklearn/model_selection/_split.py", line 2562, in train_test_split n_train, n_test = _validate_shuffle_split( File "/home/liv/Developer/venv3.8-tf2/lib/python3.8/site-packages/sklearn/model_selection/_split.py", line 2236, in _validate_shuffle_split raise ValueError( ValueError: With n_samples=0, test_size=None and train_size=0.8, the resulting train set will be empty. Adjust any of the aforementioned parameters.

idiap / w2v2-air-traffic

preparing the dataset uwb atcc: ValueError with n_train, n_test = _validate_shuffle_split #2