Free engine resource for the slot after finished one request decoding

google / JetStream

JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs welcome).

Apache License 2.0

194 stars 24 forks source link

Free engine resource for the slot after finished one request decoding #119

Closed FanhaiLu1 closed 1 month ago

FanhaiLu1 commented 1 month ago

This PR add one feature: free engine resource (cache and other resource) after completing a request.

Advance kennel like PageAttention reserve page block for different tokens in insert and decode, all these reserve resource must be free after completing the decode of a request, the free page block can be reused for coming requests.

Once all engine implement this function, will force this function as abstractmethod.