Open Arpit2601 opened 1 day ago
@guangy10 do you know the answer to this?
I tried it a while ago. tokenizer.save_pretrained
will save the json format, even with legacy=True
it doesn't save to the format that can be accepted by the llama_runner. I was trying to use the tokenizers.save
for the convention as shown below, which is WIP and I haven't got a chance to back on it
import os
import argparse
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import Whitespace
def create_tokenizer_model(input_dir, output_file):
vocab_file = os.path.join(input_dir, "vocab.json")
merges_file = os.path.join(input_dir, "merges.txt")
# Create BPE model from files
bpe = BPE.from_file(vocab_file, merges_file)
# Create tokenizer
tokenizer = Tokenizer(bpe)
# Set pre-tokenizer
tokenizer.pre_tokenizer = Whitespace()
# Save the tokenizer model
tokenizer.save(output_file)
print(f"Tokenizer model saved to {output_file}")
# Verify the tokenizer
loaded_tokenizer = Tokenizer.from_file(output_file)
test_text = "Hello, world! This is a test."
encoded = loaded_tokenizer.encode(test_text)
decoded = loaded_tokenizer.decode(encoded.ids)
print(f"Test encoding/decoding: '{test_text}' -> '{decoded}'")
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Create a tokenizer")
parser.add_argument("--input_dir", help="Directory containing vocab.json and merges.txt")
parser.add_argument("--output", default="tokenizer.model", help="Output file name (default: tokenizer.model)")
args = parser.parse_args()
create_tokenizer_model(args.input_dir, args.output)
Thanks @guangy10 for sharing your WIP script - I tried iterating on it but it would be great if you can share some pointers to get it working.
Thanks @guangy10 for sharing your WIP script - I tried iterating on it but it would be great if you can share some pointers to get it working.
Will keep you posted when I get back to this work
Hi, I am trying to convert SmolLm-135M-Instruct model to .pte format and then run on an android device. I have been successful in converting the model but executorch requires the tokenizer in either .bin format or .model format which can then be converted into .bin format. But on huggingface tokenizer.model or tokenizer.bin files are not present.
How would I go about converting the tokenizer.json file into the appropriate format.