How to speed up model inference

ml-research / LlavaGuard

Apache License 2.0

23 stars 0 forks source link

How to speed up model inference #1

Open whyiug opened 4 months ago

whyiug commented 4 months ago

Hi, guys, thanks for your work. I got a question: the fixed policy templates are too long, which can seriously affect the speed of model inference, have you considered optimisation methods? Is it possible to store kv cache. For the llamaguard, prefix KV caching can be used if it is prefixed.(This may not be possible because of the llava architecture, where the prefix is an image and not a fixed template, and the token of the image is not fixed. I was just wondering what you guys were thinking.)

whyiug commented 4 months ago

Here's an idea, put the policy in the system prompt.

lukashelff commented 3 months ago

Thank you for the hint. Initially, we also thought about stating the policy within our system prompt. Unfortunately, the conversation templates are implemented relatively statically in the training code of llava. So far, we haven't had the chance to implement it, but the idea is very sensible, and we will probably include it in our next iteration of LlavaGuard.