Open ShushanArakelyan opened 11 months ago
I have reported the same issue on HF.
Seeing that this project has moved to Llama2 architecture, I have been attempting to convert this model to LLAMA GGML format.
I am currently at a dead end because of inoperable implementations of
get_vocab
andsave_vocabulary
methods intokenization_codegen25.py
. When attempting to invoke theget_vocab
method the issue is that some of the vocabulary uses a different encoding from the definedutf-8
.These could be solutions: a. Change
tokenization_codegen25.py line 169
encoding fromutf-8
tolatin-1
b. With the next version of this model filter non utf-8 characters from the vocabulary
I wanted to check if Codegen2.5 uses the same vocabulary as Codegen2 (a question to the authors: does it?), and noticed that calling .get_vocab() on tokenizer produces an error.
How to reproduce:
The expected output would be a dictionary with vocabulary. The output I get instead is: