philschmid commented 8 months ago

What does this PR do?

This PR adds a dedicated Hugging Face client, which allows llmperf user to benchmark Hugging Face models using TGI on the API inference, Inference Endpoints or Locally/any URL.

Below is an simple example

run tgi

docker run --gpus all -ti -p 8080:80   -e MODEL_ID=HuggingFaceH4/zephyr-7b-beta ghcr.io/huggingface/text-generation-inference:latest

run benchmark


export HUGGINGFACE_API_BASE="http://localhost:8080"
export MODEL_ID="HuggingFaceH4/zephyr-7b-beta"

python token_benchmark_ray.py \
--model $MODEL_ID \
--mean-input-tokens 550 \
--stddev-input-tokens 150 \
--mean-output-tokens 150 \
--stddev-output-tokens 10 \
--max-num-completed-requests 2 \
--timeout 600 \
--num-concurrent-requests 1 \
--results-dir "result_outputs" \
--llm-api huggingface \
--additional-sampling-params '{}'

philschmid commented 8 months ago

cc @waleedkadous

slyt commented 4 months ago

@philschmid The README mentions HUGGINGFACE_API_KEY, but I couldn't get the your fork to benchmark Llama3 on an instance of text-generation-inference server without specifying HUGGINGFACE_API_TOKEN. Is there a difference between HUGGINGFACE_API_TOKEN and HUGGINGFACE_API_KEY? Should all references be one or the other?

src/llmperf/ray_clients/huggingface_client.py is using HUGGINFACE_API_TOKEN
litellm is using HUGGINFACE_API_KEY
Huggingface Hub python library has HF_TOKEN which supersedes the deprecatedHUGGING_FACE_HUB_TOKEN

If HUGGINGFACE_API_TOKEN is not set, you get this error when trying to benchmark meta-llama/Meta-Llama-3-70B-Instruct. It can't pull the tokenizer without the token because Llama3 tokenizer is behind an agreement acknowledgment page:

OSError: You are trying to access a gated repo.
Make sure to have access to it at https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct.
401 Client Error. (Request ID: Root=1-668c4b2e-082a7cbe6986c4514589204c;528c624d-4cfa-42f0-bd0f-d3f2e1431fbf)

Cannot access gated repo for url https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct/resolve/main/config.json.
Access to model meta-llama/Meta-Llama-3-70B-Instruct is restricted. You must be authenticated to access it.
  0%|                                                               | 0/2 [00:06<?, ?it/s]

ray-project / llmperf

Add Hugging face client #42

What does this PR do?