mistralai / mistral-common

Apache License 2.0
633 stars 57 forks source link

Tokenize from .jsonl files #9

Open computabeast opened 4 months ago

computabeast commented 4 months ago

It would be nice to tokenize straight from a .jsonl file.

tokenizer = MistralTokenizer.from_model("open-mixtral-8x22b")
tokenized = tokenizer.from_jsonl("my_file.jsonl")
tokens, text = tokenized.tokens, tokenized.text
...