Unable to transcribe audio when using a fine-tuned whisper medium model

yilmazay74 commented 1 year ago

Discussed in https://github.com/linto-ai/whisper-timestamped/discussions/129

^{Originally posted by **yilmazay74** October 30, 2023} Hi All, We have been using whisper for a while. Recently we started to generate our own finetuned models by adding customized audio and transcription data. We can use these new finetuned models with standart whisper inference scripts without problems. However ... Recently we wanted also to have word by word timestamps in the results, so we wanted to use whisper-timestamped. With Whisper-timestamped, we can transcribe audios by using pretrained whisper models (e.g. whisper-medium-model) without problems . However, whenever we want to use our own fine tuned models it throws exceptions. If I use the load_model( ) method it gives the following exception: In case of: (using model = whisper.load_model( )): **_File "/home/tekrom/components/whisper/service/__init__.py", line 29, in create_app model = load_model("service/models/medium-v4/pytorch_model.bin", device="cpu") File "/home/tekrom/components/whisper/service/transcribe.py", line 2191, in load_model whisper_model.load_state_dict(hf_state_dict) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 2041, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for Whisper: Unexpected key(s) in state_dict: "proj_out.weight"._** If I use WhisperForConditionalGeneration class's from_pretrained() method it gives the following expception: In case of: ( using model = WhisperForConditionalGeneration.from_pretrained("service/models/medium-v4") **_File "/home/tekrom/components/whisper/service/__init__.py", line 138, in asr3 left_converted_result = processChannel(get_audio_tensor(audio_left)) File "/home/tekrom/components/whisper/service/__init__.py", line 176, in processChannel wResult = transcribe(model, File "/home/tekrom/components/whisper/service/transcribe.py", line 226, in transcribe_timestamped input_stride = N_FRAMES // model.dims.n_audio_ctx File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1614, in __getattr__ raise AttributeError("'{}' object has no attribute '{}'".format( AttributeError: 'WhisperForConditionalGeneration' object has no attribute 'dims'_** So, It looks like the model format that whisper-timestamped expecting and the one that fine-tuning generated are different, and so, it cannot find some attributes and it fails. I would appreciate if someone guides me about how to resolve this issue. Thanks in advance. Y. Ay

Jeronymous commented 1 year ago

whisper-timestamped should be able to load finetuned models (with Speechbrain or transformers).

You are not giving the code you use for real investigation. But given those lines...

File "/home/tekrom/components/whisper/service/init.py", line 29, in create_app
model = load_model("service/models/medium-v4/pytorch_model.bin", device="cpu")

...I guess you are using the loading function from whisper, not the one from whisper-timestamped.

try from whisper_timestamped import load_model

yilmazay74 commented 1 year ago

Thanks Jerome. Looks like that is it :) Thanks for pointing.

linto-ai / whisper-timestamped

Unable to transcribe audio when using a fine-tuned whisper medium model #130

Discussed in https://github.com/linto-ai/whisper-timestamped/discussions/129