Open jstjoe opened 2 months ago
Thanks for the feature request. Some notes:
1) This goes beyond just embeddings and should be a generic utility that can be used with LLM calls as well. 2) The tokenizers themselves won't be integrated into the package since they are often large (and lead to large dependencies), can cause deployment problems, and new providers can be added any time. Instead an interface will be offered for integrating tokenizers. 3) You can get for by using a character-count based estimation, i.e. 1 token = 3-4 characters on average (for the main openai tokenizer at least)
Feature Description
I'd love at least one, or all of these features to solve my challenges discussed in Use Case:
countTokens(string, model)
a utility function to calculate token count for a string given a model (since they can calculate token length differently).embed()
to automatically truncate and retry on failure, or pre-emptively truncate before sending to the API (if tokens can be counted without calling the API).embed()
to automatically chunk if it's too long, based on the limits and token counts of the chosen model. This would effectively, optionally, letembed()
return many embeddings. This might actually fit better as an option to pass a single string toembedMany()
and get back as many as needed?Use Case
I love using the embed() function to quickly and easily generate embeddings for a string with the model and settings I provide. I use it all over for different use cases.
But recently I started running into a problem when trying to use embed with user-provided documents. I want a vector representation of the document as a whole for comparing document similarity, but if the document contains too many tokens for the chosen model - which is not simply dependent the character length but the token length depending on the tokenization behavior of the model in question - then embed errors out. And to really calculate and truncate, summarize, or chunk the string accordingly I'll want to calculate the token length for substrings before calling embed to fail once again.
This makes it difficult to switch between models, but also just to rely entirely on the AI SDK. To call
embed()
with a string of unknown length (where they are expected to get quite long) I need to write my own logic to first check that string's token length, and keep my own map of the maxToken limits for different models/APIs. These problems seem like they would be common for LLM API users and dynamic across APIs so would fit well in Vercel's AI SDK wheelhouse.Thank you for your consideration, and the great SDK.
Additional context
Generating embeddings for user-provided content of variable (and generally long) lengths presents problems that seem unsolved.