meta-llama / llama

Inference code for Llama models
Other
55.78k stars 9.51k forks source link

Will the cache kv become invalid? #1099

Open oslijunw opened 5 months ago

oslijunw commented 5 months ago

In a multi-threaded situation, if the GPU server resources are insufficient, will cache kv preemption occur? For example, there are two conversations at the same time, both of which are long. If the two conversations are halfway through and conversation a cuts into conversation b, the cache kv in conversation b should be lost, that is, the cache kv of conversation a is used. Due to the involvement of gpu computing and insufficient resources, verification cannot be carried out