Open adrien-lesur opened 4 months ago
Hey @adrien-lesur , at some point, we considered having the support of HuggingFace Inference Endpoints but we learned that it's not used widely.
How would you usually deploy those models? I assume https://github.com/neuralmagic/deepsparse or something.
Hi @asofter, The models would usually be deployed via vLLM like documented here for Mistral.
Is your feature request related to a problem? Please describe. My understanding of the documentation and the code is that
llm-guard
will lazy-load the models required by the chosen scanners from Huggingface. I apologize if this is incorrectThis is not ideal for consumers like Kubernetes workloads because :
llm-guard
is used as a libraryllm-guard-api
dedicated deployment with more resourcesllm-guard-api
deployment to scale too, and you face the same resource optimization issue.A third option is that you already have the models deployed somewhere in a central place so that the only information required by the scanners would be the inference URL and the authentication.
Describe the solution you'd like Users that use a platform to host and run models in a central place should be able to provide inference URLs and authentication to the scanners, instead of lazy-loading the models.
Describe alternatives you've considered The existing possible usages described by the documentation (as a library or as API).