vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.52k stars 4.05k forks source link

[Feature]: Guided Decoding Schema Cache Store #8902

Open berniwal opened 11 hours ago

berniwal commented 11 hours ago

🚀 The feature, motivation and pitch

Problem

I am currently working with structured outputs and experimented a little with VLLM + Outlines. Since our JSON Schemas can get quite complex the generation of the FSM can take around 2 Minutes per Schema. It would be great to have a feature where you can provide a Schema-Store to save your generated schemas over time in a local file and reload them when you restart your deployment. Ideally this would be implemented as flag in the vllm serve arguments:

https://docs.vllm.ai/en/latest/models/engine_args.html

Current Implementation

I assume that this is currently not supported and the code to not recompute the schema is handled with the @cache() decorator here: Screenshot 2024-09-27 134948

Alternatives

Alternative solution would probably be to create custom python code to handle this for my use-case and use the VLLM python functions for generation instead of the "VLLM serve" command. However not sure how you could handle this with the API Deployment.

Additional context

PS: Happy to contribute to this feature if this is something that can be useful to other people / makes also sense for the people who understand the code base better.

Before submitting a new issue...

simon-mo commented 2 hours ago

Yes contribution welcomed. However, I believe outlines already have a schema cache nowadays, it might be a better idea to first investigate why that didn't work, or how to get that schema cache working with configurable path