Open damnfarooq opened 11 months ago
This is the output from previous command, I got stuck in run_eval_model.sh now
(Farooq_thesis) phd-research@phd-research:~/research_space/w2v2-air-traffic$ bash /home/phd-research/research_space/w2v2-air-traffic/src/run_train_kenlm.sh
About to start the KenLM Dataset name: uwb_atcc Output folder: experiments/data/uwb_atcc/train/lm uwb_atcc experiments/data/uwb_atcc/train/text
Exporting dataset to text file experiments/data/uwb_atcc/train/lm/4_corpus.txt... lmplz -o 4 --text experiments/data/uwb_atcc/train/lm/4_corpus.txt --arpa experiments/data/uwb_atcc/train/lm/uwb_atcc_4g_no_fix.arpa === 1/5 Counting and sorting n-grams === Reading /home/phd-research/research_space/w2v2-air-traffic/experiments/data/uwb_atcc/train/lm/4_corpus.txt ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
Unigram tokens 113301 types 1766 === 2/5 Calculating and sorting adjusted counts === Chain sizes: 1:21192 2:2261260544 3:4239863552 4:6783781888 Statistics: 1 1765 D1=0.595645 D2=0.962202 D3+=1.62725 2 16099 D1=0.732908 D2=1.05218 D3+=1.47953 3 38208 D1=0.799969 D2=1.12127 D3+=1.28138 4 60883 D1=0.823461 D2=1.16559 D3+=1.23074 Memory estimate for binary LM: type kB probing 2387 assuming -p 1.5 probing 2712 assuming -r models -p 1.5 trie 950 without quantization trie 472 assuming -q 8 -b 8 quantization trie 895 assuming -a 22 array pointer compression trie 418 assuming -a 22 -q 8 -b 8 array pointer compression and quantization === 3/5 Calculating and sorting initial probabilities === Chain sizes: 1:21180 2:257584 3:764160 4:1461192 ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100 #################################################################################################### === 4/5 Calculating and writing order-interpolated probabilities === Chain sizes: 1:21180 2:257584 3:764160 4:1461192 ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100 #################################################################################################### === 5/5 Writing ARPA model === ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
Name:lmplz VmPeak:13164676 kB VmRSS:9084 kB RSSMax:2609216 kB user:0.196349 sys:0.464826 CPU:0.661188 real:0.647472 corrected Ken LM in experiments/data/uwb_atcc/train/lm/uwb_atcc_4g.arpa build_binary trie experiments/data/uwb_atcc/train/lm/uwb_atcc_4g.arpa experiments/data/uwb_atcc/train/lm/uwb_atcc_4g.binary Reading experiments/data/uwb_atcc/train/lm/uwb_atcc_4g.arpa ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
Identifying n-grams omitted by SRI ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
Writing trie ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
SUCCESS done doing training of KenLM check the output folder: experiments/data/uwb_atcc/train/lm Done training 4-gram in experiments/data/uwb_atcc/train/lm
I FIXED THE ISSUE BY CHANGING THE PATH TO MODEL IN run_eval_model.sh to: I GOT THE OUTPUT BUT THERE ARE SOME WARNINGS RELATED TO UNIGRAM NOT SURE IF IT WORKED AS IT EXPECTED TO BE OR NOT:
path_to_model="experiments/results/baselines/wav2vec2-base/uwb_atcc/0.0ld_0.0ad_0.0attd_0.0fpd_0.01mtp_12mtl_0.0mfp_12mfl_2acc"
(Farooq_thesis) phd-research@phd-research:~/research_space/w2v2-air-traffic$ bash src/run_eval_model.sh About to evaluate a Wav2Vec 2.0 model Dataset in: experiments/data/uwb_atcc/test Output folder: /home/phd-research/research_space/w2v2-air-traffic/experiments/results/baselines/wav2vec2-base/uwb_atcc/output Integrating a LM by shallow fusion, results should be better Loading the Wav2Vec 2.0 model, loading... Unigrams not provided and cannot be automatically determined from LM file (only arpa format). Decoding accuracy might be reduced. Found entries of length > 1 in alphabet. This is unusual unless style is BPE, but the alphabet was not recognized as BPE type. Is this correct? No known unigrams provided, decoding results might be a lot worse. Loading the dataset... Using custom data configuration test-085e5dd7a4b8bb1c Downloading and preparing dataset atc_data_loader/test to /home/phd-research/research_space/w2v2-air-traffic/.cache/eval/experiments/data/uwb_atcc/test/atc_data_loader/test-085e5dd7a4b8bb1c/0.0.0/f2633cc53c6abe32cddd4152eebde1a4e3c9953e1446e190b8d9a13330cddaa4... Dataset atc_data_loader downloaded and prepared to /home/phd-research/research_space/w2v2-air-traffic/.cache/eval/experiments/data/uwb_atcc/test/atc_data_loader/test-085e5dd7a4b8bb1c/0.0.0/f2633cc53c6abe32cddd4152eebde1a4e3c9953e1446e190b8d9a13330cddaa4. Subsequent calls will reuse this data. 67%|████████████████████████████████████████████████████ | 2/3 [00:47<00:23, 23.59s/ba]
as_target_processor
is deprecated and will be removed in v5 of Transformers. You can process your labels by using the argument text
of the regular __call__
method (either in the same call as your audio inputs, or in a separate call.warnings.warn(
/home/phd-research/anaconda3/envs/Farooq_thesis/lib/python3.10/site-packages/transformers/models/wav2vec2/processing_wav2vec2.py:154: UserWarning: as_target_processor
is deprecated and will be removed in v5 of Transformers. You can process your labels by using the argument text
of the regular __call__
method (either in the same call as your audio inputs, or in a separate call.
warnings.warn(
/home/phd-research/anaconda3/envs/Farooq_thesis/lib/python3.10/site-packages/transformers/models/wav2vec2/processing_wav2vec2.py:154: UserWarning: as_target_processor
is deprecated and will be removed in v5 of Transformers. You can process your labels by using the argument text
of the regular __call__
method (either in the same call as your audio inputs, or in a separate call.
warnings.warn(
/home/phd-research/anaconda3/envs/Farooq_thesis/lib/python3.10/site-packages/transformers/models/wav2vec2/processing_wav2vec2.py:154: UserWarning: as_target_processor
is deprecated and will be removed in v5 of Transformers. You can process your labels by using the argument text
of the regular __call__
method (either in the same call as your audio inputs, or in a separate call.
warnings.warn(
Performing inference on dataset... Loading
inference: 100%|█████████████████████████████████████████████████████████████| 2824/2824 [16:40<00:00, 2.82ex/s] Downloading builder script: 100%|███████████████████████████████████████████| 5.60k/5.60k [00:00<00:00, 7.62MB/s] printing the ASR results in /home/phd-research/research_space/w2v2-air-traffic/experiments/results/baselines/wav2vec2-base/uwb_atcc/0.0ld_0.0ad_0.0attd_0.0fpd_0.01mtp_12mtl_0.0mfp_12mfl_2acc/output/uwb_atcc/hypo Done! Done evaluating model in /home/phd-research/research_space/w2v2-air-traffic/experiments/results/baselines/wav2vec2-base/uwb_atcc/0.0ld_0.0ad_0.0attd_0.0fpd_0.01mtp_12mtl_0.0mfp_12mfl_2acc with LM
I am having this problem can you help me fix this issue?
(Farooq_thesis) phd-research@phd-research:~/research_space/w2v2-air-traffic$ bash src/run_eval_model.sh About to evaluate a Wav2Vec 2.0 model Dataset in: experiments/data/uwb_atcc/test Output folder: experiments/results/baselines/wav2vec2-base/uwb_atcc/0.0ld_0.0ad_0.0attd_0.0fpd_0.01mtp_12mtl_0.0mfp_12mfl_2acc/output Integrating a LM by shallow fusion, results should be better Loading the Wav2Vec 2.0 model, loading... /home/phd-research/anaconda3/envs/Farooq_thesis/lib/python3.10/site-packages/transformers/models/wav2vec2/processing_wav2vec2.py:53: FutureWarning: Loading a tokenizer inside Wav2Vec2Processor from a config that does not include a
tokenizer_class
attribute is deprecated and will be removed in v5. Please add'tokenizer_class': 'Wav2Vec2CTCTokenizer'
attribute to either yourconfig.json
ortokenizer_config.json
file to suppress this warning: warnings.warn( Traceback (most recent call last): File "/home/phd-research/anaconda3/envs/Farooq_thesis/lib/python3.10/site-packages/transformers/models/wav2vec2/processing_wav2vec2.py", line 51, in from_pretrained return super().from_pretrained(pretrained_model_name_or_path, kwargs) File "/home/phd-research/anaconda3/envs/Farooq_thesis/lib/python3.10/site-packages/transformers/processing_utils.py", line 182, in from_pretrained args = cls._get_arguments_from_pretrained(pretrained_model_name_or_path, kwargs) File "/home/phd-research/anaconda3/envs/Farooq_thesis/lib/python3.10/site-packages/transformers/processing_utils.py", line 226, in _get_arguments_from_pretrained args.append(attribute_class.from_pretrained(pretrained_model_name_or_path, *kwargs)) File "/home/phd-research/anaconda3/envs/Farooq_thesis/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 640, in from_pretrained return tokenizer_class_py.from_pretrained(pretrained_model_name_or_path, inputs, **kwargs) File "/home/phd-research/anaconda3/envs/Farooq_thesis/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1761, in from_pretrained raise EnvironmentError( OSError: Can't load tokenizer for 'experiments/results/baselines/wav2vec2-base/uwb_atcc/0.0ld_0.0ad_0.0attd_0.0fpd_0.01mtp_12mtl_0.0mfp_12mfl_2acc/checkpoint-10000'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'experiments/results/baselines/wav2vec2-base/uwb_atcc/0.0ld_0.0ad_0.0attd_0.0fpd_0.01mtp_12mtl_0.0mfp_12mfl_2acc/checkpoint-10000' is the correct path to a directory containing all relevant files for a Wav2Vec2CTCTokenizer tokenizer.During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/phd-research/research_space/w2v2-air-traffic/src/eval_model.py", line 250, in
main()
File "/home/phd-research/research_space/w2v2-air-traffic/src/eval_model.py", line 152, in main
processor, processor_ctc_kenlm, model = get_kenlm_processor(path_model, path_lm)
File "/home/phd-research/research_space/w2v2-air-traffic/src/eval_model.py", line 47, in get_kenlm_processor
processor = AutoProcessor.from_pretrained(path_tokenizer)
File "/home/phd-research/anaconda3/envs/Farooq_thesis/lib/python3.10/site-packages/transformers/models/auto/processing_auto.py", line 254, in from_pretrained
return PROCESSOR_MAPPING[type(config)].from_pretrained(pretrained_model_name_or_path, kwargs)
File "/home/phd-research/anaconda3/envs/Farooq_thesis/lib/python3.10/site-packages/transformers/models/wav2vec2/processing_wav2vec2.py", line 63, in from_pretrained
tokenizer = Wav2Vec2CTCTokenizer.from_pretrained(pretrained_model_name_or_path, kwargs)
File "/home/phd-research/anaconda3/envs/Farooq_thesis/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1761, in from_pretrained
raise EnvironmentError(
OSError: Can't load tokenizer for 'experiments/results/baselines/wav2vec2-base/uwb_atcc/0.0ld_0.0ad_0.0attd_0.0fpd_0.01mtp_12mtl_0.0mfp_12mfl_2acc/checkpoint-10000'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'experiments/results/baselines/wav2vec2-base/uwb_atcc/0.0ld_0.0ad_0.0attd_0.0fpd_0.01mtp_12mtl_0.0mfp_12mfl_2acc/checkpoint-10000' is the correct path to a directory containing all relevant files for a Wav2Vec2CTCTokenizer tokenizer.