microsoft / SpeechT5

Unified-Modal Speech-Text Pre-Training for Spoken Language Processing
MIT License
1.21k stars 114 forks source link

Error in loading WavLLM model #78

Open rishabh004-ai opened 7 months ago

rishabh004-ai commented 7 months ago

I have installed all the libraries and whenever I am running bash examples/wavllm/scripts/inference_sft.sh $model_path $data_name. The code is throwing the error as _pickle.UnpicklingError: invalid load key, '\xef'. The error is originating from the line models, saved_cfg = checkpoint_utils.load_model_ensemble() in 454 of SpeechT5/WavLLM/fairseq/examples/wavllm/inference/generate.py File "/workspace/SpeechT5/WavLLM/fairseq/examples/wavllm/inference/generate.py", line 454, in <module> cli_main() File "/workspace/SpeechT5/WavLLM/fairseq/examples/wavllm/inference/generate.py", line 450, in cli_main main(args) File "/workspace/SpeechT5/WavLLM/fairseq/examples/wavllm/inference/generate.py", line 50, in main return _main(cfg, h) File "/workspace/SpeechT5/WavLLM/fairseq/examples/wavllm/inference/generate.py", line 122, in _main models, saved_cfg = checkpoint_utils.load_model_ensemble( File "/workspace/SpeechT5/WavLLM/fairseq/fairseq/checkpoint_utils.py", line 363, in load_model_ensemble ensemble, args, _task = load_model_ensemble_and_task( File "/workspace/SpeechT5/WavLLM/fairseq/fairseq/checkpoint_utils.py", line 421, in load_model_ensemble_and_task state = load_checkpoint_to_cpu(filename, arg_overrides) File "/workspace/SpeechT5/WavLLM/fairseq/fairseq/checkpoint_utils.py", line 315, in load_checkpoint_to_cpu state = torch.load(f, map_location=torch.device("cpu")) File "/root/miniconda3/envs/wavllm/lib/python3.10/site-packages/torch/serialization.py", line 1040, in load return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args) File "/root/miniconda3/envs/wavllm/lib/python3.10/site-packages/torch/serialization.py", line 1258, in _legacy_load magic_number = pickle_module.load(f, **pickle_load_args) _pickle.UnpicklingError: invalid load key, '\xef'.

XiaoshanHsj commented 7 months ago

I've tried to download the checkpoint and load it directly using torch.load(model_path, map_location="cpu") And it can be loaded correctly. Could you try this to check whether your downloaded .pt is correct?

rishabh004-ai commented 7 months ago

Hi, Thanks for the prompt response. I checked, and the model download link is not working. I also checked the link with wget, but it is not reachable. Can you please help me by providing the alternate link?

XiaoshanHsj commented 7 months ago

Hi, the link has been updated. Do you use this new link, or the old version? The new link works for me to download.

rishabh004-ai commented 7 months ago

Hi, Thanks for responding. I checked with the new link, but I am still getting errors in the download link. The link is not working for me. The error is ?xml version="1.0" encoding="utf-8"?><Error><Code>PublicAccessNotPermitted</Code><Message>Public access is not permitted on this storage account. RequestId:45b0ef67-801e-0083-46f9-97df30000000

XiaoshanHsj commented 6 months ago

We have uploaded the checkpoint to huggingface. You could download it from https://huggingface.co/v-sjhu/WavLLM Thanks

BinWang28 commented 6 months ago

@rishabh004-ai Were you able to inference successfully?

@XiaoshanHsj Do you consider releasing the inference framework under transformers instead of fairseq?

YepJin commented 6 months ago

We have uploaded the checkpoint to huggingface. You could download it from https://huggingface.co/v-sjhu/WavLLM Thanks

The hugging face link seems not work, can you help upload it, thanks! Also, may i ask what's the $model_path and $data_name you are utilizing, @XiaoshanHsj ?

BinWang28 commented 6 months ago

@YepJin

I made it work.

bash examples/wavllm/scripts/inference_sft.sh you_path_to/final.pt asr

Also, have to change the content in asr.csv. Otherwie, will lead to not found error.

id  audio   n_frames    prompt  tgt_text    with_speech
0   examples/wavllm/test_data/audio/asr.flac    166960  Based on the attached audio, generate a comprehensive text transcription of the spoken content. he hoped there would be stew for dinner turnips and carrots and bruised potatoes and fat mutton pieces to be ladled out in thick peppered flour fattened sauce  True
XiaoshanHsj commented 6 months ago

@BinWang28 Thanks for your reply. Recently, we have no plans to move the code base under "transformers". In the next version, e.g. based on the LLAMA-3, we may try to use "transformers" to conduct our model.