Open yongfengdu opened 3 weeks ago
what is the issue you are facing, can you please post error log from docker here.
I'm using helm install to test: https://github.com/opea-project/GenAIInfra/tree/main/helm-charts/common/tgi Using command like this: helm install tgi tgi --set LLM_MODEL_ID=ise-uiuc/Magicoder-S-DS-6.7B
Error message/pod logs:
{"timestamp":"2024-08-19T05:38:39.361300Z","level":"INFO","fields":{"message":"Args {\n model_id: \"ise-uiuc/Magicoder-S-DS-6.7B\",\n revision: None,\n validation_workers: 2,\n sharded: None,\n num_shard: None,\n quantize: None,\n speculate: None,\n dtype: None,\n trust_remote_code: false,\n max_concurrent_requests: 128,\n max_best_of: 2,\n max_stop_sequences: 4,\n max_top_n_tokens: 5,\n max_input_tokens: None,\n max_input_length: None,\n max_total_tokens: None,\n waiting_served_ratio: 0.3,\n max_batch_prefill_tokens: None,\n max_batch_total_tokens: None,\n max_waiting_tokens: 20,\n max_batch_size: None,\n cuda_graphs: None,\n hostname: \"tgi-874bfcffc-c4wst\",\n port: 2080,\n shard_uds_path: \"/tmp/text-generation-server\",\n master_addr: \"localhost\",\n master_port: 29500,\n huggingface_hub_cache: Some(\n \"/data\",\n ),\n weights_cache_override: None,\n disable_custom_kernels: false,\n cuda_memory_fraction: 1.0,\n rope_scaling: None,\n rope_factor: None,\n json_output: true,\n otlp_endpoint: None,\n otlp_service_name: \"text-generation-inference.router\",\n cors_allow_origin: [],\n api_key: None,\n watermark_gamma: None,\n watermark_delta: None,\n ngrok: false,\n ngrok_authtoken: None,\n ngrok_edge: None,\n tokenizer_config_path: None,\n disable_grammar_support: false,\n env: false,\n max_client_batch_size: 4,\n lora_adapters: None,\n usage_stats: On,\n}"},"target":"text_generation_launcher"}
{"timestamp":"2024-08-19T05:38:39.361458Z","level":"INFO","fields":{"message":"Token file not found \"/tmp/.cache/huggingface/token\"","log.target":"hf_hub","log.module_path":"hf_hub","log.file":"/usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs","log.line":55},"target":"hf_hub"}
{"timestamp":"2024-08-19T05:38:39.361623Z","level":"INFO","fields":{"message":"Model supports up to 16384 but tgi will now set its default to 4096 instead. This is to save VRAM by refusing large prompts in order to allow more users on the same hardware. You can increase that size using --max-batch-prefill-tokens=16434 --max-total-tokens=16384 --max-input-tokens=16383
."},"target":"text_generation_launcher"}
{"timestamp":"2024-08-19T05:38:39.361636Z","level":"INFO","fields":{"message":"Default max_input_tokens
to 4095"},"target":"text_generation_launcher"}
{"timestamp":"2024-08-19T05:38:39.361640Z","level":"INFO","fields":{"message":"Default max_total_tokens
to 4096"},"target":"text_generation_launcher"}
{"timestamp":"2024-08-19T05:38:39.361643Z","level":"INFO","fields":{"message":"Default max_batch_prefill_tokens
to 4145"},"target":"text_generation_launcher"}
{"timestamp":"2024-08-19T05:38:39.361648Z","level":"INFO","fields":{"message":"Using default cuda graphs [1, 2, 4, 8, 16, 32]"},"target":"text_generation_launcher"}
{"timestamp":"2024-08-19T05:38:39.361854Z","level":"INFO","fields":{"message":"Starting check and download process for ise-uiuc/Magicoder-S-DS-6.7B"},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2024-08-19T05:38:42.469115Z","level":"INFO","fields":{"message":"Files are already present on the host. Skipping download."},"target":"text_generation_launcher"}
{"timestamp":"2024-08-19T05:38:43.169166Z","level":"INFO","fields":{"message":"Successfully downloaded weights for ise-uiuc/Magicoder-S-DS-6.7B"},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2024-08-19T05:38:43.169575Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2024-08-19T05:38:46.051416Z","level":"WARN","fields":{"message":"FBGEMM fp8 kernels are not installed."},"target":"text_generation_launcher"}
{"timestamp":"2024-08-19T05:38:46.070139Z","level":"INFO","fields":{"message":"Using Attention = False"},"target":"text_generation_launcher"}
{"timestamp":"2024-08-19T05:38:46.070193Z","level":"INFO","fields":{"message":"Using Attention = paged"},"target":"text_generation_launcher"}
{"timestamp":"2024-08-19T05:38:46.123324Z","level":"WARN","fields":{"message":"Could not import Mamba: No module named 'mamba_ssm'"},"target":"text_generation_launcher"}
{"timestamp":"2024-08-19T05:38:46.294082Z","level":"INFO","fields":{"message":"affinity={0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47}, membind = {0}"},"target":"text_generation_launcher"}
{"timestamp":"2024-08-19T05:38:46.662238Z","level":"ERROR","fields":{"message":"Error when initializing model\nTraceback (most recent call last):\n File \"/opt/conda/bin/text-generation-server\", line 8, in TRANSFORMERS_CACHE
is deprecated and will be removed in v5 of Transformers. Use HF_HOME
instead.\n warnings.warn(\n2024-08-19 05:38:45.737 | INFO | text_generation_server.utils.import_utils:dtype
and quantize
, as they │\n│ 108 │ │ ) │\n│ ❱ 109 │ server.serve( │\n│ 110 │ │ model_id, │\n│ 111 │ │ lora_adapters, │\n│ 112 │ │ revision, │\n│ │\n│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │\n│ │ dtype = None │ │\n│ │ json_output = True │ │\n│ │ logger_level = 'INFO' │ │\n│ │ lora_adapters = [] │ │\n│ │ max_input_tokens = 4095 │ │\n│ │ model_id = 'ise-uiuc/Magicoder-S-DS-6.7B' │ │\n│ │ otlp_endpoint = None │ │\n│ │ otlp_service_name = 'text-generation-inference.router' │ │\n│ │ quantize = None │ │\n│ │ revision = None │ │\n│ │ server = <module 'text_generation_server.server' from │ │\n│ │ '/opt/conda/lib/python3.10/site-packages/text_gener… │ │\n│ │ setup_tracing = <function setup_tracing at 0x7f9f4843c9d0> │ │\n│ │ sharded = False │ │\n│ │ speculate = None │ │\n│ │ trust_remote_code = False │ │\n│ │ uds_path = PosixPath('/tmp/text-generation-server') │ │\n│ ╰──────────────────────────────────────────────────────────────────────────╯ │\n│ │\n│ /opt/conda/lib/python3.10/site-packages/text_generation_server/server.py:274 │\n│ in serve │\n│ │\n│ 271 │ │ while signal_handler.KEEP_PROCESSING: │\n│ 272 │ │ │ await asyncio.sleep(0.5) │\n│ 273 │ │\n│ ❱ 274 │ asyncio.run( │\n│ 275 │ │ serve_inner( │\n│ 276 │ │ │ model_id, │\n│ 277 │ │ │ lora_adapters, │\n│ │\n│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │\n│ │ dtype = None │ │\n│ │ lora_adapters = [] │ │\n│ │ max_input_tokens = 4095 │ │\n│ │ model_id = 'ise-uiuc/Magicoder-S-DS-6.7B' │ │\n│ │ quantize = None │ │\n│ │ revision = None │ │\n│ │ serve_inner = <function serve.
After updated tgi version to ghcr.io/huggingface/text-generation-inference:latest-intel-cpu The codegen test failed with the following 2 MODELs: ise-uiuc/Magicoder-S-DS-6.7B m-a-p/OpenCodeInterpreter-DS-6.7B
The later one is mentioned in the readme file of CodeGen: https://github.com/opea-project/GenAIExamples/tree/main/CodeGen
The default model(meta-llama/CodeLlama-7b-hf) specified by docker-compose runs fine.