Closed spring1915 closed 7 months ago
Thanks for your interest.
Yes, any model based on LLaMA2 can be supported (if you did not change the model structure).
Regarding the streaming mode, I'm not entirely sure what you're referring to. Could you please provide a bit more context or elaborate on your question? This will help me understand your query better and provide a more accurate response.
@spring1915 by streaming do you mean an online inference endpoint receiving a stream of requests?
I meant the way of serving the model that can produce a stream of tokens for a request, like what we can see with OpenAI chat.
If I understand correctly, it is supported. There are examples in chatbot.py.
A fine-tuned Llama2 model may be stored locally. Can be it integrated with lade?
Can lade be used when the model is served in the streaming mode?