salesforce / CodeGen

CodeGen is a family of open-source model for program synthesis. Trained on TPU-v4. Competitive with OpenAI Codex.
Apache License 2.0
4.9k stars 379 forks source link

Error calling tokenizer.get_vocab() (Codegen2.5) #85

Open ShushanArakelyan opened 11 months ago

ShushanArakelyan commented 11 months ago

I wanted to check if Codegen2.5 uses the same vocabulary as Codegen2 (a question to the authors: does it?), and noticed that calling .get_vocab() on tokenizer produces an error.

How to reproduce:

from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen25-7b-multi", trust_remote_code=True)
tokenizer.get_vocab()

The expected output would be a dictionary with vocabulary. The output I get instead is:

"UnicodeDecodeError Traceback (most recent call last) Cell In[18], line 1 ----> 1 tokenizer.get_vocab()

File /home/shushan/.cache/huggingface/modules/transformers_modules/Salesforce/codegen25-7b-multi/d4dc9dd90e8b23d5411e6d970e3a11e88dc5c2bc/tokenization_codegen25.py:153, in CodeGen25Tokenizer.get_vocab(self) 151 def get_vocab(self): 152 """Returns vocab as a dict""" --> 153 vocab = {self._convert_id_to_token(i): i for i in range(self.vocab_size)} 154 return vocab

File /home/shushan/.cache/huggingface/modules/transformers_modules/Salesforce/codegen25-7b-multi/d4dc9dd90e8b23d5411e6d970e3a11e88dc5c2bc/tokenization_codegen25.py:153, in (.0) 151 def get_vocab(self): 152 """Returns vocab as a dict""" --> 153 vocab = {self._convert_id_to_token(i): i for i in range(self.vocab_size)} 154 return vocab

File /home/shushan/.cache/huggingface/modules/transformers_modules/Salesforce/codegen25-7b-multi/d4dc9dd90e8b23d5411e6d970e3a11e88dc5c2bc/tokenization_codegen25.py:169, in CodeGen25Tokenizer._convert_id_to_token(self, index) 167 def _convert_id_to_token(self, index): 168 """Converts an index (integer) in a token (str) using the vocab.""" --> 169 return self.encoder.decode_single_token_bytes(index).decode("utf-8")

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa1 in position 0: invalid start byte"

nedix commented 10 months ago

I have reported the same issue on HF.

Seeing that this project has moved to Llama2 architecture, I have been attempting to convert this model to LLAMA GGML format.

I am currently at a dead end because of inoperable implementations of get_vocab and save_vocabulary methods in tokenization_codegen25.py. When attempting to invoke the get_vocab method the issue is that some of the vocabulary uses a different encoding from the defined utf-8.

These could be solutions: a. Change tokenization_codegen25.py line 169 encoding from utf-8 to latin-1 b. With the next version of this model filter non utf-8 characters from the vocabulary