Model reuse in TextGeneration examples

jondot commented 10 months ago

Hi, I'd like to rig one of the examples into a service, where the service (http) gets a prompt and runs TextGeneration. As it stands, TextGeneration wants to own model and tokenizer, which means they need to be created from scratch each time (time consuming, and unacceptable for a per-request lifecycle). Any recommendations on how to do this?

ealmloff commented 10 months ago

It looks like https://github.com/huggingface/candle/pull/1370 might solve this issue for the quantized version of llama. You could clear the cache after every request and then keep generating

A different approach is separating the history of the session from the model entirely. Then after you are done with the history, you can still reuse the tokenizer and model without resetting anything. That is what I do in Kalosm here. That lets you generate text without changing the state of the model after the text is generated. Separating the history also lets you serialize and deserialize the history which may be useful if you want to resume text generation quickly after a disconnect.

Edit: I also created a streaming text generation sever that uses candle here

danielclough commented 10 months ago

Here is another example to reference.

The model loads when the server starts so that multiple users can connect to the same instance.

I'm just passing a &model and then .clone()ing it.

jondot commented 10 months ago

@danielclough thanks! @ealmloff thanks! Kalosm looks great, I'll try to use it directly. looks like you use both llm-rs and candle. whats your impression?

ealmloff commented 10 months ago

@danielclough thanks! @ealmloff thanks! Kalosm looks great, I'll try to use it directly. looks like you use both llm-rs and candle. whats your impression?

llm-rs is faster, but it supports less models and it is less controllable. llm-rs only exposes code for basic text generation, so you cannot save a chat history cache or use constrained generation.

jondot commented 10 months ago

@ealmloff just coming back to say - kalosm is REALLY REALLY great! just integrated into a service flawless. I didn't have a tokio runtime crash due to tokio shutdown with reqwest (like i had with other infrastructure), I really think it should be a basis for candle itself for how to make it accessible. Also -- happy if you could release a version for kalosm itself (I'm using git) Kudos!

ealmloff commented 10 months ago

@ealmloff just coming back to say - kalosm is REALLY REALLY great! just integrated into a service flawless. I didn't have a tokio runtime crash due to tokio shutdown with reqwest (like i had with other infrastructure), I really think it should be a basis for candle itself for how to make it accessible.

Thanks! I'm glad it works well for you. Let me know if you run into any issues

Also -- happy if you could release a version for kalosm itself (I'm using git) Kudos!

I'm working on adding some documentation here. After that is finished, I plan to release 0.1.0

jondot commented 10 months ago

Fantastic stuff! Thanks for the help and sorry for the trouble ❤️

EricLBuehler commented 8 months ago

@jondot , perhaps you could check out candle-vllm?

jondot commented 6 months ago

@EricLBuehler will do, i'm getting back to this topic now, trying to experiment with other models. @danielclough im wondering what would be the cost of cloning? now that I want to try every model in candle (not just llama family models), seems that this would be the best technique (other than reimplementing/patching the models that the candle team created)

jondot commented 6 months ago

Meanwhile I did a test with mistral, it takes an order of 1-1.5ms to clone a fresh loaded model instead of 100us. I believe it's a considerable overhead for Rust (i.e. Rust doing some hard work cloning the tree)

huggingface / candle

Model reuse in TextGeneration examples #1419