ollama / ollama

Get up and running with Llama 3.2, Mistral, Gemma 2, and other large language models.
https://ollama.com
MIT License
93.78k stars 7.42k forks source link

[WISH] API for token count? faster than embeddings vector length? #1345

Closed kettoleon closed 1 month ago

kettoleon commented 10 months ago

Hi, I've been using ollama for a few days, I really like it.

However, I'm using it by making raw requests, I mean I'm handling the context myself.

When under this use case, the system needs to count tokens for many strings to decide what goes into the context and what is too much.

For now, I've been using the embedding API, and taking the length of embeddings vector as token count.

But I understand an "only count tokens without computing embeddings" API would be way faster.

I'm assuming something like that to be possible? I was using exllama before ollama, and it had something like that. But I never went into the details to see how it was done.

It would be awesome if someone could make a PR for that, or point me in the right direction to do the PR myself 😜 (although my python knowledge is scarce) .

oliverbob commented 10 months ago

Yes, this is very practical to see implemented for this repo. Excited to see it in action.

jukofyork commented 8 months ago

I was just thinking about this too - didn't think of using the count from the embedding call though - thanks!

Would still be better to have the token count returned. Possibly even have the ability to return the tokenized text and get that returned too. Somebody linked this on reddit the other day and it's quite interesting:

https://www.danieldemmel.me/tokenizer.html

suvalaki commented 8 months ago

This would be pretty valuable. Useful for other libraries which call token counting methods as a part of their everyday flow. I want to integrate Ollama with some such flows.

I spent some time looking at what can be added: seems simple enough. However i noticed this discussion: https://github.com/ollama/ollama/pull/988

I believe that an encoding endpoint is still relevant because it enables a broader range of APIs. Id love some clarity if i should complete and PR my changes.

I pretty much copy pasted the generate script... https://github.com/ollama/ollama/compare/main...suvalaki:ollama:main (Not very dry, but will await further comment before improving)

Looks a bit like this at a request level: Input

{
  "model": "mistral:latest",
  "prompt": "Why is the sky blue?"
}

Output

{
    "model": "mistral:latest",
    "created_at": "2024-02-05T21:49:44.472893Z",
    "total_duration": 8965307875,
    "load_duration": 8961889917,
    "context": [
        733,
        16289,
        28793,
        28705,
        4315,
        349,
        272,
        7212,
        5045,
        28804,
        733,
        28748,
        16289,
        28793,
        13
    ],
    "prompt_eval_count": 15
}
oliverbob commented 8 months ago

There is yet no native support, but ollama-webui seems to have it.

suvalaki commented 8 months ago

I mean it being available in the webui doesnt really solve the issue does it.

oliverbob commented 8 months ago

image

If I understand you correctly.

suvalaki commented 8 months ago

id just read the modifications i made in my branch and you'll see the delta difference. Its the difference between apriori knowledge and aposteriori ...

You just want access to the underlying tokenizer without needing to call generate (at the api layer)

oliverbob commented 8 months ago

Sounds Greek to me. Wish you all the luck.

chigkim commented 5 months ago

It seems like a lot of people want this.

https://github.com/ollama/ollama/issues/1716 and https://github.com/ollama/ollama/issues/3582

Llama.cpp server has POST /tokenize and POST /detokenize now, so hopefully Ollama can just expose the api.

jmorganca commented 1 month ago

Thanks for the issue!

/api/show will now show embedding length

Regarding /tokenize and /detokenize, closing for this issue: https://github.com/ollama/ollama/issues/3582