pytorch / executorch

On-device AI across mobile, embedded and edge for PyTorch
https://pytorch.org/executorch/
Other
2.17k stars 358 forks source link

How to convert tokenizer of SmolLM model as accepted by executorch #6813

Open Arpit2601 opened 1 day ago

Arpit2601 commented 1 day ago

Hi, I am trying to convert SmolLm-135M-Instruct model to .pte format and then run on an android device. I have been successful in converting the model but executorch requires the tokenizer in either .bin format or .model format which can then be converted into .bin format. But on huggingface tokenizer.model or tokenizer.bin files are not present.

How would I go about converting the tokenizer.json file into the appropriate format.

larryliu0820 commented 1 day ago

@guangy10 do you know the answer to this?

guangy10 commented 1 day ago

I tried it a while ago. tokenizer.save_pretrained will save the json format, even with legacy=Trueit doesn't save to the format that can be accepted by the llama_runner. I was trying to use the tokenizers.save for the convention as shown below, which is WIP and I haven't got a chance to back on it

import os
import argparse
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import Whitespace
def create_tokenizer_model(input_dir, output_file):
    vocab_file = os.path.join(input_dir, "vocab.json")
    merges_file = os.path.join(input_dir, "merges.txt")
    # Create BPE model from files
    bpe = BPE.from_file(vocab_file, merges_file)

    # Create tokenizer
    tokenizer = Tokenizer(bpe)

    # Set pre-tokenizer
    tokenizer.pre_tokenizer = Whitespace()

    # Save the tokenizer model
    tokenizer.save(output_file)
    print(f"Tokenizer model saved to {output_file}")
    # Verify the tokenizer
    loaded_tokenizer = Tokenizer.from_file(output_file)
    test_text = "Hello, world! This is a test."
    encoded = loaded_tokenizer.encode(test_text)
    decoded = loaded_tokenizer.decode(encoded.ids)
    print(f"Test encoding/decoding: '{test_text}' -> '{decoded}'")
if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Create a tokenizer")
    parser.add_argument("--input_dir", help="Directory containing vocab.json and merges.txt")
    parser.add_argument("--output", default="tokenizer.model", help="Output file name (default: tokenizer.model)")
    args = parser.parse_args()
    create_tokenizer_model(args.input_dir, args.output)
Arpit2601 commented 16 hours ago

Thanks @guangy10 for sharing your WIP script - I tried iterating on it but it would be great if you can share some pointers to get it working.

guangy10 commented 1 hour ago

Thanks @guangy10 for sharing your WIP script - I tried iterating on it but it would be great if you can share some pointers to get it working.

Will keep you posted when I get back to this work