openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.
MIT License
12.31k stars 833 forks source link

How to make it support bigscience/bloom ? #55

Closed hawgjmrd72 closed 1 year ago

hauntsaninja commented 1 year ago

Thanks for your interest in tiktoken! The API for the Encoding class is documented over here: https://github.com/openai/tiktoken/blob/ec7c121e385bf1675312c6c33734de6b392890c4/tiktoken/core.py#L26

hawgjmrd72 commented 1 year ago

Thanks for your interest in tiktoken! The API for the Encoding class is documented over here:

https://github.com/openai/tiktoken/blob/ec7c121e385bf1675312c6c33734de6b392890c4/tiktoken/core.py#L26

Yes, I already read it, but I still don't know how to make the parameters such as pat_str and mergeable_ranks. I was wonder if bigscience/bloom share the same tokenizer with GPT2 since there are both based on byte-level Byte-Pair-Encoding.

hauntsaninja commented 1 year ago

I'd look at the code they've shared to figure out what arguments to pass. I've only worked directly on OpenAI's models — so it's not like I have any special knowledge about bigscience/bloom ;-)