Closed kettoleon closed 1 month ago
Yes, this is very practical to see implemented for this repo. Excited to see it in action.
I was just thinking about this too - didn't think of using the count from the embedding call though - thanks!
Would still be better to have the token count returned. Possibly even have the ability to return the tokenized text and get that returned too. Somebody linked this on reddit the other day and it's quite interesting:
This would be pretty valuable. Useful for other libraries which call token counting methods as a part of their everyday flow. I want to integrate Ollama with some such flows.
I spent some time looking at what can be added: seems simple enough. However i noticed this discussion: https://github.com/ollama/ollama/pull/988
I believe that an encoding endpoint is still relevant because it enables a broader range of APIs. Id love some clarity if i should complete and PR my changes.
I pretty much copy pasted the generate script... https://github.com/ollama/ollama/compare/main...suvalaki:ollama:main (Not very dry, but will await further comment before improving)
Looks a bit like this at a request level: Input
{
"model": "mistral:latest",
"prompt": "Why is the sky blue?"
}
Output
{
"model": "mistral:latest",
"created_at": "2024-02-05T21:49:44.472893Z",
"total_duration": 8965307875,
"load_duration": 8961889917,
"context": [
733,
16289,
28793,
28705,
4315,
349,
272,
7212,
5045,
28804,
733,
28748,
16289,
28793,
13
],
"prompt_eval_count": 15
}
There is yet no native support, but ollama-webui seems to have it.
I mean it being available in the webui doesnt really solve the issue does it.
If I understand you correctly.
id just read the modifications i made in my branch and you'll see the delta difference. Its the difference between apriori knowledge and aposteriori ...
You just want access to the underlying tokenizer without needing to call generate (at the api layer)
Sounds Greek to me. Wish you all the luck.
It seems like a lot of people want this.
https://github.com/ollama/ollama/issues/1716 and https://github.com/ollama/ollama/issues/3582
Llama.cpp server has POST /tokenize and POST /detokenize now, so hopefully Ollama can just expose the api.
Thanks for the issue!
/api/show
will now show embedding length
Regarding /tokenize
and /detokenize
, closing for this issue: https://github.com/ollama/ollama/issues/3582
Hi, I've been using ollama for a few days, I really like it.
However, I'm using it by making raw requests, I mean I'm handling the context myself.
When under this use case, the system needs to count tokens for many strings to decide what goes into the context and what is too much.
For now, I've been using the embedding API, and taking the length of embeddings vector as token count.
But I understand an "only count tokens without computing embeddings" API would be way faster.
I'm assuming something like that to be possible? I was using exllama before ollama, and it had something like that. But I never went into the details to see how it was done.
It would be awesome if someone could make a PR for that, or point me in the right direction to do the PR myself 😜 (although my python knowledge is scarce) .