Add advanced support for splitting strategies for tokens

williamhogman commented 1 year ago

Function for measuring context windows

A tokenisation algo
number of tokens to split to
Returns a Vec where each string can fit the window.

SpirosMakris commented 1 year ago

This looks like something easy enough for me to take a shot at. Regarding the tokenization algorithm, did you have something specific in mind?

Been playing with LangChain lately (that's how I found this excellent project) and they have a number of text splitters some naive (eg whitespace based), some that use tokens (eg HuggingFace, TikToken, etc) and finally others that are integrations from other packages eg NLTK. Any thoughts on what should be implemented first? Since at this point llm-chain integrates with OpenAI & Llama i would think Tiktoken and Llama's tokenizer would be good candidates along with a naive whitespace based one.

Cheers, Spiros

williamhogman commented 1 year ago

@SpirosMakris

Actually we support for splitting and counting tokens yesterday but I think the API needs some polish. and we need to add support for overlap's and more rich functionality.

For the new code: https://github.com/sobelio/llm-chain/blob/main/llm-chain/src/tokens.rs

SpirosMakris commented 1 year ago

Hi Will!

Submitted a PR for this, but since it's my first one ever, I don't know how to link it with this issue.

sobelio / llm-chain

Add advanced support for splitting strategies for tokens #10