triton-inference-server / client

Triton Python, C++ and Java client libraries, and GRPC-generated client examples for go, java and scala.
BSD 3-Clause "New" or "Revised" License
521 stars 225 forks source link

Incomplete installation of all genai-perf dependencies prevents its from being run on air-gapped servers #682

Open mirekphd opened 1 month ago

mirekphd commented 1 month ago

When genai-perf is installed using pip from Github (as documented), on first run it tries to download several files from Huggingface, like this:

$ docker run --rm -it --name test -u 0 gpu-tritonserver-tst:latest bash -c "genai-perf --help"
tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 700/700 [00:00<00:00, 5.45MB/s]
tokenizer.model: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500k/500k [00:00<00:00, 78.0MB/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.84M/1.84M [00:00<00:00, 3.25MB/s]
special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 411/411 [00:00<00:00, 2.72MB/s]
usage: genai-perf [-h] [--expected-output-tokens EXPECTED_OUTPUT_TOKENS] [--input-type {url,file,synthetic}] [--input-tokens-mean INPUT_TOKENS_MEAN] [--input-tokens-stddev INPUT_TOKENS_STDDEV] -m MODEL
                  [--num-of-output-prompts NUM_OF_OUTPUT_PROMPTS] [--output-format {openai_chat_completions,openai_completions,trtllm,vllm}] [--random-seed RANDOM_SEED] [--concurrency CONCURRENCY]
                  [--input-data INPUT_DATA] [-p MEASUREMENT_INTERVAL] [--profile-export-file PROFILE_EXPORT_FILE] [--request-rate REQUEST_RATE] [--service-kind {triton,openai}] [-s STABILITY_PERCENTAGE]
                  [--streaming] [-v] [--version] [--endpoint ENDPOINT] [-u URL] [--dataset {openorca,cnn_dailymail}]

CLI to profile LLMs and Generative AI models with Perf Analyzer
[..]

This "calling-home" behavior will prevent genai-perf from running correctly in corporate air-gapped environments. All required dependencies need to be collected by Python or Bash scripts at install time (which can occur on a different sever, such as an internet-connected build server(s) where most pull actions are permitted) rather than being pulled upon the first run of the program.

nv-hwoo commented 1 month ago

Hi @mirekphd opened a ticket TMA-1955 to track the issue.

dyastremsky commented 1 week ago

Thanks for bringing up this concern. Would adding a cached file option here work? You could then get the tokenizer files into the container for your air-gapped environment on your end. Unfortunately, the alternative would be to install every private/public tokenizer possible into the container, which is not possible, or to save one default tokenizer but leave the rest still uninstalled.