huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.22k stars 26.83k forks source link

[ I built it! ] Server application with on-the-fly quantization to serve up HF-Hub models for back-and-forth request-response inferencing #32478

Open abgulati opened 2 months ago

abgulati commented 2 months ago

Feature request

It would be immensely useful to have a server-application to serve up HF-Transformer and other Hub models as a service, similar to the how llama.cpp bundles the llama-server application to load GGUFs into memory and expose them via an API to enable back-and-forth inferencing request-responses without having to reload the model for each query.

Motivation

Serving up HF-Hub models with ease: We had some excellent LLMs released in the last few weeks and I was naturally eager to try them the second they were available but had to wait for llama.cpp to bake in proper support first.

While llama.cpp and GGUfs are amazing especially for hybrid-inferencing, waiting on support for models based on new tokenizers/attention-mechanisms and performing a lengthy recompilation of llama.cpp with every release can be cumbersome, especially in containerized environments with GPUs.

This got thinking that there has to be a better way! Since model creators release LLMs primarily with Transformers support in mind, if there were an easy way to serve them up via a local-API in their native HF-Transformer format, especially with on-the-fly quantization, we could run them as a local-inferencing service the day they're out, with at most a few pip-updates to local Python packages in most cases.

As there wasn't such a server readily avaialble, and the general advice online was to build your own, I did and am looking to contribute it to this amazing project!

Your contribution

Presenting HF-Waitress, named after the Flask-WSGI server it leverages!

Git repo: https://github.com/abgulati/hf-waitress

HF-Waitress enables loading HF-Transformer & AWQ-quantized models directly off the hub, while providing on-the-fly quantization via BitsAndBytes, HQQ and Quanto for the former. It negates the need to manually download any model yourself, simply working off the models name instead. It requires no setup, and provides concurrency and streaming responses all from within a single, easily-portable, platform-agnostic Python script.

Key Features

API Endpoints

  1. /completions (POST): Generate completions for given messages.
  2. /completions_stream (POST): Stream completions for given messages.
  3. /health (GET): Check the health and get information about the loaded model.
  4. /hf_config_reader_api (POST): Read values from the configuration.
  5. /hf_config_writer_api (POST): Write values to the configuration.
  6. /restart_server (GET): Restart the LLM server.

Usage

To start the server, run: python hf_waitress.py [arguments]

Example:

python hf_waitress.py --model_id=mistralai/Mistral-Nemo-Instruct-2407 --quantize=quanto --quant_level=int4 --access_token=<token> --trust_remote_code --use_flash_attention_2 --do_sample

launch-arguments are optional, even on the first run! See README for defaults.

Command-line Arguments

Detailed documentation in the repo!

Since this is an independent script file and is in very active development, I wasn't sure of how to go about contributing it as a PR. I'd be very grateful for guidance on if and how to go about contributing this application to this repo, so humbly requesting help here!

PS - Reddit post made by me announcing HF-Waitress: https://www.reddit.com/r/LocalLLaMA/comments/1eijqc6/the_softwarepain_of_running_local_llm_finally_got/

amyeroberts commented 2 months ago

Hi @abgulati, nice work, thanks for sharing!

Really nice to see tools like this being built open source for the community. I'd suggest leaving this as a stand-alone library which you can maintain and adapt over time according to what you find is most useful as it spans many different HF projects and features.