linto-ai / whisper-timestamped

Multilingual Automatic Speech Recognition with word-level timestamps and confidence
GNU Affero General Public License v3.0
2.01k stars 156 forks source link

Unable to transcribe audio when using a fine-tuned whisper medium model #130

Closed yilmazay74 closed 1 year ago

yilmazay74 commented 1 year ago

Discussed in https://github.com/linto-ai/whisper-timestamped/discussions/129

Originally posted by **yilmazay74** October 30, 2023 Hi All, We have been using whisper for a while. Recently we started to generate our own finetuned models by adding customized audio and transcription data. We can use these new finetuned models with standart whisper inference scripts without problems. However ... Recently we wanted also to have word by word timestamps in the results, so we wanted to use whisper-timestamped. With Whisper-timestamped, we can transcribe audios by using pretrained whisper models (e.g. whisper-medium-model) without problems . However, whenever we want to use our own fine tuned models it throws exceptions. If I use the load_model( ) method it gives the following exception: In case of: (using model = whisper.load_model( )): **_File "/home/tekrom/components/whisper/service/__init__.py", line 29, in create_app model = load_model("service/models/medium-v4/pytorch_model.bin", device="cpu") File "/home/tekrom/components/whisper/service/transcribe.py", line 2191, in load_model whisper_model.load_state_dict(hf_state_dict) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 2041, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for Whisper: Unexpected key(s) in state_dict: "proj_out.weight"._** If I use WhisperForConditionalGeneration class's from_pretrained() method it gives the following expception: In case of: ( using model = WhisperForConditionalGeneration.from_pretrained("service/models/medium-v4") **_File "/home/tekrom/components/whisper/service/__init__.py", line 138, in asr3 left_converted_result = processChannel(get_audio_tensor(audio_left)) File "/home/tekrom/components/whisper/service/__init__.py", line 176, in processChannel wResult = transcribe(model, File "/home/tekrom/components/whisper/service/transcribe.py", line 226, in transcribe_timestamped input_stride = N_FRAMES // model.dims.n_audio_ctx File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1614, in __getattr__ raise AttributeError("'{}' object has no attribute '{}'".format( AttributeError: 'WhisperForConditionalGeneration' object has no attribute 'dims'_** So, It looks like the model format that whisper-timestamped expecting and the one that fine-tuning generated are different, and so, it cannot find some attributes and it fails. I would appreciate if someone guides me about how to resolve this issue. Thanks in advance. Y. Ay
Jeronymous commented 1 year ago

whisper-timestamped should be able to load finetuned models (with Speechbrain or transformers).

You are not giving the code you use for real investigation. But given those lines...

File "/home/tekrom/components/whisper/service/init.py", line 29, in create_app
model = load_model("service/models/medium-v4/pytorch_model.bin", device="cpu")

...I guess you are using the loading function from whisper, not the one from whisper-timestamped.

try from whisper_timestamped import load_model

yilmazay74 commented 1 year ago

Thanks Jerome. Looks like that is it :) Thanks for pointing.