Open abgulati opened 2 months ago
Hi @abgulati, nice work, thanks for sharing!
Really nice to see tools like this being built open source for the community. I'd suggest leaving this as a stand-alone library which you can maintain and adapt over time according to what you find is most useful as it spans many different HF projects and features.
Feature request
It would be immensely useful to have a server-application to serve up HF-Transformer and other Hub models as a service, similar to the how
llama.cpp
bundles thellama-server
application to load GGUFs into memory and expose them via an API to enable back-and-forth inferencing request-responses without having to reload the model for each query.Motivation
Serving up HF-Hub models with ease: We had some excellent LLMs released in the last few weeks and I was naturally eager to try them the second they were available but had to wait for
llama.cpp
to bake in proper support first.While
llama.cpp
andGGUfs
are amazing especially for hybrid-inferencing, waiting on support for models based on new tokenizers/attention-mechanisms and performing a lengthy recompilation ofllama.cpp
with every release can be cumbersome, especially in containerized environments with GPUs.This got thinking that there has to be a better way! Since model creators release LLMs primarily with Transformers support in mind, if there were an easy way to serve them up via a local-API in their native HF-Transformer format, especially with on-the-fly quantization, we could run them as a local-inferencing service the day they're out, with at most a few pip-updates to local Python packages in most cases.
As there wasn't such a server readily avaialble, and the general advice online was to build your own, I did and am looking to contribute it to this amazing project!
Your contribution
Presenting
HF-Waitress
, named after the Flask-WSGI server it leverages!Git repo: https://github.com/abgulati/hf-waitress
HF-Waitress enables loading HF-Transformer & AWQ-quantized models directly off the hub, while providing on-the-fly quantization via BitsAndBytes, HQQ and Quanto for the former. It negates the need to manually download any model yourself, simply working off the models name instead. It requires no setup, and provides concurrency and streaming responses all from within a single, easily-portable, platform-agnostic Python script.
Key Features
config.json
to store settings, allowing for easy configuration and persistence across runs.API Endpoints
/completions
(POST): Generate completions for given messages./completions_stream
(POST): Stream completions for given messages./health
(GET): Check the health and get information about the loaded model./hf_config_reader_api
(POST): Read values from the configuration./hf_config_writer_api
(POST): Write values to the configuration./restart_server
(GET): Restart the LLM server.Usage
To start the server, run:
python hf_waitress.py [arguments]
Example:
launch-arguments are optional, even on the first run! See README for defaults.
Command-line Arguments
--model_id
: The model ID in HF-Transformers format - see below for details.--access_gated
: Set to True if accessing gated models you're approved for.--access_token
: Your Hugging Face Access Token.--gguf
: Add this flag if attempting to load a GGUF model - For future use, not presently functional--gguf_model_id
: GGUF repository ID - For future use, not presently functional--gguf_filename
: Specific GGUF filename - For future use, not presently functional--quantize
: Quantization method ('bitsandbytes', 'quanto', 'hqq' or 'n' for none, see important details below.).--quant_level
: Quantization level (Valid values - BitsAndBytes: int8 & int4; Quanto: int8, int4 and int2; HQQ: int8, int4, int3, int2, int1).--hqq_group_size
: Specify group_size (default: 64) for HQQ quantization. No restrictions as long as weight.numel() is divisible by the group_size.--push_to_hub
: Push quantized model to Hugging Face Hub.--torch_device_map
: Specify inference device (e.g., 'cuda', 'cpu').--torch_dtype
: Specify model tensor type.--trust_remote_code
: Allow execution of custom code from the model's repository.--use_flash_attention_2
: Attempt to use Flash Attention 2 - Only for specific Nvidia GPUs--pipeline_task
: Specify the pipeline task (default: 'text-generation').--max_new_tokens
: Maximum number of tokens to generate.--return_full_text
: Return the full text including the prompt.--temperature
: Set LLM temperature (0.0 to 2.0) - set do_sample to True for temps above 0.0, and False when setting temperature=0.0!--do_sample
: Perform sampling when selecting response tokens - must be set to True for temps above 0.0!--top_k
,--top_p
,--min_p
: Token selection parameters - must set do_sample to True!--port
: Specify the server port (default: 9069).--reset_to_defaults
: Reset all settings to default values.Detailed documentation in the repo!
Since this is an independent script file and is in very active development, I wasn't sure of how to go about contributing it as a PR. I'd be very grateful for guidance on if and how to go about contributing this application to this repo, so humbly requesting help here!
PS - Reddit post made by me announcing HF-Waitress: https://www.reddit.com/r/LocalLLaMA/comments/1eijqc6/the_softwarepain_of_running_local_llm_finally_got/