Closed williamhogman closed 1 year ago
This looks like something easy enough for me to take a shot at. Regarding the tokenization algorithm, did you have something specific in mind?
Been playing with LangChain lately (that's how I found this excellent project) and they have a number of text splitters some naive (eg whitespace based), some that use tokens (eg HuggingFace, TikToken, etc) and finally others that are integrations from other packages eg NLTK. Any thoughts on what should be implemented first? Since at this point llm-chain integrates with OpenAI & Llama i would think Tiktoken and Llama's tokenizer would be good candidates along with a naive whitespace based one.
Cheers, Spiros
@SpirosMakris
Actually we support for splitting and counting tokens yesterday but I think the API needs some polish. and we need to add support for overlap's and more rich functionality.
For the new code: https://github.com/sobelio/llm-chain/blob/main/llm-chain/src/tokens.rs
Hi Will!
Submitted a PR for this, but since it's my first one ever, I don't know how to link it with this issue.
Function for measuring context windows