[ I built it! ] Server application with on-the-fly quantization to serve up HF-Hub models for back-and-forth request-response inferencing

Feature request

It would be immensely useful to have a server-application to serve up HF-Transformer and other Hub models as a service, similar to the how llama.cpp bundles the llama-server application to load GGUFs into memory and expose them via an API to enable back-and-forth inferencing request-responses without having to reload the model for each query.

Motivation

Serving up HF-Hub models with ease: We had some excellent LLMs released in the last few weeks and I was naturally eager to try them the second they were available but had to wait for llama.cpp to bake in proper support first.

While llama.cpp and GGUfs are amazing especially for hybrid-inferencing, waiting on support for models based on new tokenizers/attention-mechanisms and performing a lengthy recompilation of llama.cpp with every release can be cumbersome, especially in containerized environments with GPUs.

This got thinking that there has to be a better way! Since model creators release LLMs primarily with Transformers support in mind, if there were an easy way to serve them up via a local-API in their native HF-Transformer format, especially with on-the-fly quantization, we could run them as a local-inferencing service the day they're out, with at most a few pip-updates to local Python packages in most cases.

As there wasn't such a server readily avaialble, and the general advice online was to build your own, I did and am looking to contribute it to this amazing project!

Your contribution

Presenting HF-Waitress, named after the Flask-WSGI server it leverages!

Git repo: https://github.com/abgulati/hf-waitress

HF-Waitress enables loading HF-Transformer & AWQ-quantized models directly off the hub, while providing on-the-fly quantization via BitsAndBytes, HQQ and Quanto for the former. It negates the need to manually download any model yourself, simply working off the models name instead. It requires no setup, and provides concurrency and streaming responses all from within a single, easily-portable, platform-agnostic Python script.

Key Features

On-the-fly, in-place quantization: Supports int8 & int4 quantization via BitsAndBytes, int8, int4 and int2 quantization via Quanto and int8, int4, int3, int2, int1 quantization via HQQ
Model Agnosticism: Compatible with any HF-Transformers format LLM.
Configuration Management: Uses config.json to store settings, allowing for easy configuration and persistence across runs.
Error Handling: Detailed logging and traceback reporting via centralized error-handling functions.
Health Endpoint: Provides valuable information about the loaded model and server health.
Concurrency Control: Uses semaphores for selective concurrency while taking advantage of semaphore-native queueing.
Streaming Responses: Supports both standard and streaming completions.

API Endpoints

/completions (POST): Generate completions for given messages.
/completions_stream (POST): Stream completions for given messages.
/health (GET): Check the health and get information about the loaded model.
/hf_config_reader_api (POST): Read values from the configuration.
/hf_config_writer_api (POST): Write values to the configuration.
/restart_server (GET): Restart the LLM server.

Usage

To start the server, run: python hf_waitress.py [arguments]

Example:

python hf_waitress.py --model_id=mistralai/Mistral-Nemo-Instruct-2407 --quantize=quanto --quant_level=int4 --access_token=<token> --trust_remote_code --use_flash_attention_2 --do_sample

launch-arguments are optional, even on the first run! See README for defaults.

Command-line Arguments

--model_id: The model ID in HF-Transformers format - see below for details.
--access_gated: Set to True if accessing gated models you're approved for.
--access_token: Your Hugging Face Access Token.
--gguf: Add this flag if attempting to load a GGUF model - For future use, not presently functional
--gguf_model_id: GGUF repository ID - For future use, not presently functional
--gguf_filename: Specific GGUF filename - For future use, not presently functional
--quantize: Quantization method ('bitsandbytes', 'quanto', 'hqq' or 'n' for none, see important details below.).
--quant_level: Quantization level (Valid values - BitsAndBytes: int8 & int4; Quanto: int8, int4 and int2; HQQ: int8, int4, int3, int2, int1).
--hqq_group_size: Specify group_size (default: 64) for HQQ quantization. No restrictions as long as weight.numel() is divisible by the group_size.
--push_to_hub: Push quantized model to Hugging Face Hub.
--torch_device_map: Specify inference device (e.g., 'cuda', 'cpu').
--torch_dtype: Specify model tensor type.
--trust_remote_code: Allow execution of custom code from the model's repository.
--use_flash_attention_2: Attempt to use Flash Attention 2 - Only for specific Nvidia GPUs
--pipeline_task: Specify the pipeline task (default: 'text-generation').
--max_new_tokens: Maximum number of tokens to generate.
--return_full_text: Return the full text including the prompt.
--temperature: Set LLM temperature (0.0 to 2.0) - set do_sample to True for temps above 0.0, and False when setting temperature=0.0!
--do_sample: Perform sampling when selecting response tokens - must be set to True for temps above 0.0!
--top_k, --top_p, --min_p: Token selection parameters - must set do_sample to True!
--port: Specify the server port (default: 9069).
--reset_to_defaults: Reset all settings to default values.

Detailed documentation in the repo!

Since this is an independent script file and is in very active development, I wasn't sure of how to go about contributing it as a PR. I'd be very grateful for guidance on if and how to go about contributing this application to this repo, so humbly requesting help here!

PS - Reddit post made by me announcing HF-Waitress: https://www.reddit.com/r/LocalLLaMA/comments/1eijqc6/the_softwarepain_of_running_local_llm_finally_got/

huggingface / transformers