jumon / whisper-punctuator

Zero-shot multimodal punctuation insertion and truecasing using Whisper
MIT License
94 stars 5 forks source link

Tokenizer object has no attribute 'tokenizer' #7

Closed Gusreis7 closed 4 months ago

Gusreis7 commented 1 year ago

Hi thanks for your project ! I've been trying to use your work to punctuate some audios in portuguese, but I got stuck with some problems with the Tokenizer

First I got in punctuate.py: line 84, in init self.tokenizer = self.whisper_tokenizer.tokenizer AttributeError: 'Tokenizer' object has no attribute 'tokenizer'

By removing the .tokenizer, I got another error in punctuate.py:

line 221 tokenizer has no convert ids tokenizer.convert_ids_to_tokens

Do you have any ideia why this is happening?

jumon commented 1 year ago

Thank you for trying out this project!

The issue you are experiencing is due to a recent change in whisper (https://github.com/openai/whisper/pull/1044), which has replaced Hugging Face's tokenizer with TikToken. I will modify this repository to ensure compatibility with the latest version of Whisper.

In the meantime, as a workaround, you can use the older version of Whisper by running the following command:

pip install openai-whisper==20230308

Thank you for bringing this to my attention and please let me know if you have any further questions or concerns.