In a multi-threaded situation, if the GPU server resources are insufficient, will cache kv preemption occur? For example, there are two conversations at the same time, both of which are long. If the two conversations are halfway through and conversation a cuts into conversation b, the cache kv in conversation b should be lost, that is, the cache kv of conversation a is used. Due to the involvement of gpu computing and insufficient resources, verification cannot be carried out
In a multi-threaded situation, if the GPU server resources are insufficient, will cache kv preemption occur? For example, there are two conversations at the same time, both of which are long. If the two conversations are halfway through and conversation a cuts into conversation b, the cache kv in conversation b should be lost, that is, the cache kv of conversation a is used. Due to the involvement of gpu computing and insufficient resources, verification cannot be carried out