tenstorrent / tt-inference-server

Apache License 2.0
2 stars 0 forks source link

Initial vLLM setup fails due to missing HuggingFace permissions #37

Open milank94 opened 2 days ago

milank94 commented 2 days ago

When following initial setup steps from: https://github.com/tenstorrent/tt-inference-server/tree/main/vllm-tt-metal-llama3-70b#vllm-tt-metalium-llama-31-70b-inference-api, fails due to missing HF token permissions to download config for Meta-Llama-3.1-70B.

docker run \
>   --rm \
>   -it \
>   --env-file .env \
>   --cap-add ALL \
>   --device /dev/tenstorrent:/dev/tenstorrent \
>   --volume /dev/hugepages-1G:/dev/hugepages-1G:rw \
>   --volume ${PERSISTENT_VOLUME?ERROR env var PERSISTENT_VOLUME must be set}:/home/user/cache_root:rw \
>   --shm-size 32G \
>   --publish 7000:7000 \
>   ghcr.io/tenstorrent/tt-inference-server/tt-metal-llama3-70b-src-base-vllm:v0.0.1-tt-metal-385904186f81-384f1790c3be
2024-11-15 09:07:01.845 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.pearson_correlation_coefficient be migrated to C++?
2024-11-15 09:07:01.847 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.Conv1d be migrated to C++?
2024-11-15 09:07:01.848 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.conv2d be migrated to C++?
2024-11-15 09:07:01.851 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.unsqueeze_to_4D be migrated to C++?
2024-11-15 09:07:01.852 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.from_torch be migrated to C++?
2024-11-15 09:07:01.852 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.to_torch be migrated to C++?
2024-11-15 09:07:01.852 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.to_device be migrated to C++?
2024-11-15 09:07:01.852 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.from_device be migrated to C++?
2024-11-15 09:07:01.852 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.allocate_tensor_on_device be migrated to C++?
2024-11-15 09:07:01.852 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.copy_host_to_device_tensor be migrated to C++?
2024-11-15 09:07:01.852 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.deallocate be migrated to C++?
2024-11-15 09:07:01.852 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.reallocate be migrated to C++?
2024-11-15 09:07:01.852 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.load_tensor be migrated to C++?
2024-11-15 09:07:01.853 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.dump_tensor be migrated to C++?
2024-11-15 09:07:01.853 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.as_tensor be migrated to C++?
2024-11-15 09:07:01.864 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.avg_pool2d be migrated to C++?
2024-11-15 09:07:01.881 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.conv2d be migrated to C++?
2024-11-15 09:07:01.881 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.avg_pool2d be migrated to C++?
2024-11-15 09:07:01.882 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.Conv1d be migrated to C++?
INFO 11-15 09:07:02 importing.py:10] Triton not installed; certain GPU-related functions will not be available.
INFO 11-15 09:07:04 api_server.py:528] vLLM API server version 0.1.dev3062+g384f179
INFO 11-15 09:07:04 api_server.py:529] args: Namespace(allow_credentials=False, allowed_headers=['*'], allowed_methods=['*'], allowed_origins=['*'], api_key='eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ0ZWFtX2lkIjoidGVuc3RvcnJlbnQiLCJ0b2tlbl9pZCI6ImRlYnVnLXRlc3QifQ._1fGZrJLARFZgqe-aZNr5dO_gb1gtzFrqm-aWcNvGOo', block_size=64, chat_template=None, code_revision=None, collect_detailed_traces=None, config_format=<ConfigFormat.AUTO: 'auto'>, cpu_offload_gb=0, device='auto', disable_async_output_proc=False, disable_custom_all_reduce=False, disable_fastapi_docs=False, disable_frontend_multiprocessing=False, disable_log_requests=False, disable_log_stats=False, disable_logprobs_during_spec_decoding=None, disable_sliding_window=False, distributed_executor_backend=None, download_dir=None, dtype='auto', enable_auto_tool_choice=False, enable_chunked_prefill=None, enable_lora=False, enable_prefix_caching=False, enable_prompt_adapter=False, enforce_eager=False, fully_sharded_loras=False, gpu_memory_utilization=0.9, guided_decoding_backend='outlines', host=None, ignore_patterns=[], kv_cache_dtype='auto', limit_mm_per_prompt=None, load_format='auto', log_global_stats=False, long_lora_scaling_factors=None, lora_dtype='auto', lora_extra_vocab_size=256, lora_modules=None, max_context_len_to_capture=None, max_cpu_loras=None, max_log_len=None, max_logprobs=20, max_lora_rank=16, max_loras=1, max_model_len=131072, max_num_batched_tokens=131072, max_num_seqs=32, max_parallel_loading_workers=None, max_prompt_adapter_token=0, max_prompt_adapters=1, max_seq_len_to_capture=8192, middleware=[], mm_processor_kwargs=None, model='meta-llama/Meta-Llama-3.1-70B', model_loader_extra_config=None, multi_step_stream_outputs=True, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, num_gpu_blocks_override=None, num_lookahead_slots=0, num_scheduler_steps=10, num_speculative_tokens=None, otlp_traces_endpoint=None, override_neuron_config=None, pipeline_parallel_size=1, port=7000, preemption_mode=None, prompt_adapters=None, qlora_adapter_name_or_path=None, quantization=None, quantization_param_path=None, ray_workers_use_nsight=False, response_role='assistant', return_tokens_as_token_ids=False, revision=None, root_path=None, rope_scaling=None, rope_theta=None, scheduler_delay_factor=0.0, scheduling_policy='fcfs', seed=0, served_model_name=None, skip_tokenizer_init=False, spec_decoding_acceptance_method='rejection_sampler', speculative_disable_by_batch_size=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_model=None, speculative_model_quantization=None, ssl_ca_certs=None, ssl_cert_reqs=0, ssl_certfile=None, ssl_keyfile=None, swap_space=4, tensor_parallel_size=1, tokenizer=None, tokenizer_mode='auto', tokenizer_pool_extra_config=None, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_revision=None, tool_call_parser=None, tool_parser_plugin='', trust_remote_code=False, typical_acceptance_sampler_posterior_alpha=None, typical_acceptance_sampler_posterior_threshold=None, use_v2_block_manager=False, uvicorn_log_level='info', worker_use_ray=False)
INFO 11-15 09:07:04 api_server.py:166] Multiprocessing frontend to use ipc:///tmp/652c9ad3-a61d-4e6d-ad7f-b9aa98c42c0d for IPC Path.
INFO 11-15 09:07:04 api_server.py:179] Started engine process with PID 41
Traceback (most recent call last):
  File "/tt-metal/python_env/lib/python3.8/site-packages/huggingface_hub/utils/_http.py", line 406, in hf_raise_for_status
    response.raise_for_status()
  File "/tt-metal/python_env/lib/python3.8/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/meta-llama/Meta-Llama-3.1-70B/resolve/main/config.json

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "run_vllm_api_server.py", line 50, in <module>
    main()
  File "run_vllm_api_server.py", line 46, in main
    runpy.run_module("vllm.entrypoints.openai.api_server", run_name="__main__")
  File "/usr/lib/python3.8/runpy.py", line 210, in run_module
    return _run_code(code, {}, init_globals, run_name, mod_spec)
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/user/vllm/vllm/entrypoints/openai/api_server.py", line 585, in <module>
    uvloop.run(run_server(args))
  File "/tt-metal/python_env/lib/python3.8/site-packages/uvloop/__init__.py", line 82, in run
    return loop.run_until_complete(wrapper())
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/tt-metal/python_env/lib/python3.8/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
  File "/home/user/vllm/vllm/entrypoints/openai/api_server.py", line 552, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/usr/lib/python3.8/contextlib.py", line 171, in __aenter__
    return await self.gen.__anext__()
  File "/home/user/vllm/vllm/entrypoints/openai/api_server.py", line 107, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/usr/lib/python3.8/contextlib.py", line 171, in __aenter__
    return await self.gen.__anext__()
  File "/home/user/vllm/vllm/entrypoints/openai/api_server.py", line 184, in build_async_engine_client_from_engine_args
    engine_config = engine_args.create_engine_config()
  File "/home/user/vllm/vllm/engine/arg_utils.py", line 907, in create_engine_config
    model_config = self.create_model_config()
  File "/home/user/vllm/vllm/engine/arg_utils.py", line 843, in create_model_config
    return ModelConfig(
  File "/home/user/vllm/vllm/config.py", line 162, in __init__
    self.hf_config = get_config(self.model, trust_remote_code, revision,
  File "/home/user/vllm/vllm/transformers_utils/config.py", line 148, in get_config
    if is_gguf or file_or_path_exists(model,
  File "/home/user/vllm/vllm/transformers_utils/config.py", line 86, in file_or_path_exists
    return file_exists(model, config_name, revision=revision, token=token)
  File "/tt-metal/python_env/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/tt-metal/python_env/lib/python3.8/site-packages/huggingface_hub/hf_api.py", line 2833, in file_exists
    get_hf_file_metadata(url, token=token)
  File "/tt-metal/python_env/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/tt-metal/python_env/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 1666, in get_hf_file_metadata
    r = _request_wrapper(
  File "/tt-metal/python_env/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 364, in _request_wrapper
    response = _request_wrapper(
  File "/tt-metal/python_env/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 388, in _request_wrapper
    hf_raise_for_status(response)
  File "/tt-metal/python_env/lib/python3.8/site-packages/huggingface_hub/utils/_http.py", line 423, in hf_raise_for_status
    raise _format(GatedRepoError, message, response) from e
huggingface_hub.errors.GatedRepoError: 401 Client Error. (Request ID: Root=1-67377fb8-30d086136440fa987c2a0f8c;382a4350-1869-4ac9-92f0-9a75a7b2e168)

Cannot access gated repo for url https://huggingface.co/meta-llama/Meta-Llama-3.1-70B/resolve/main/config.json.
Access to model meta-llama/Llama-3.1-70B is restricted. You must have access to it and be authenticated to access it. Please log in.
2024-11-15 09:07:05.966 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.pearson_correlation_coefficient be migrated to C++?
2024-11-15 09:07:05.966 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.Conv1d be migrated to C++?
2024-11-15 09:07:05.967 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.conv2d be migrated to C++?
2024-11-15 09:07:05.967 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.unsqueeze_to_4D be migrated to C++?
2024-11-15 09:07:05.967 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.from_torch be migrated to C++?
2024-11-15 09:07:05.968 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.to_torch be migrated to C++?
2024-11-15 09:07:05.968 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.to_device be migrated to C++?
2024-11-15 09:07:05.968 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.from_device be migrated to C++?
2024-11-15 09:07:05.968 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.allocate_tensor_on_device be migrated to C++?
2024-11-15 09:07:05.968 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.copy_host_to_device_tensor be migrated to C++?
2024-11-15 09:07:05.968 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.deallocate be migrated to C++?
2024-11-15 09:07:05.968 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.reallocate be migrated to C++?
2024-11-15 09:07:05.968 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.load_tensor be migrated to C++?
2024-11-15 09:07:05.968 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.dump_tensor be migrated to C++?
2024-11-15 09:07:05.968 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.as_tensor be migrated to C++?
2024-11-15 09:07:05.970 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.avg_pool2d be migrated to C++?
2024-11-15 09:07:05.973 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.conv2d be migrated to C++?
2024-11-15 09:07:05.973 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.avg_pool2d be migrated to C++?
2024-11-15 09:07:05.973 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.Conv1d be migrated to C++?
INFO 11-15 09:07:06 importing.py:10] Triton not installed; certain GPU-related functions will not be available.
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/tt-metal/python_env/lib/python3.8/site-packages/huggingface_hub/utils/_http.py", line 406, in hf_raise_for_status
    response.raise_for_status()
  File "/tt-metal/python_env/lib/python3.8/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/meta-llama/Meta-Llama-3.1-70B/resolve/main/config.json

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/user/vllm/vllm/engine/multiprocessing/engine.py", line 391, in run_mp_engine
    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
  File "/home/user/vllm/vllm/engine/multiprocessing/engine.py", line 135, in from_engine_args
    engine_config = engine_args.create_engine_config()
  File "/home/user/vllm/vllm/engine/arg_utils.py", line 907, in create_engine_config
    model_config = self.create_model_config()
  File "/home/user/vllm/vllm/engine/arg_utils.py", line 843, in create_model_config
    return ModelConfig(
  File "/home/user/vllm/vllm/config.py", line 162, in __init__
    self.hf_config = get_config(self.model, trust_remote_code, revision,
  File "/home/user/vllm/vllm/transformers_utils/config.py", line 148, in get_config
    if is_gguf or file_or_path_exists(model,
  File "/home/user/vllm/vllm/transformers_utils/config.py", line 86, in file_or_path_exists
    return file_exists(model, config_name, revision=revision, token=token)
  File "/tt-metal/python_env/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/tt-metal/python_env/lib/python3.8/site-packages/huggingface_hub/hf_api.py", line 2833, in file_exists
    get_hf_file_metadata(url, token=token)
  File "/tt-metal/python_env/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/tt-metal/python_env/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 1666, in get_hf_file_metadata
    r = _request_wrapper(
  File "/tt-metal/python_env/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 364, in _request_wrapper
    response = _request_wrapper(
  File "/tt-metal/python_env/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 388, in _request_wrapper
    hf_raise_for_status(response)
  File "/tt-metal/python_env/lib/python3.8/site-packages/huggingface_hub/utils/_http.py", line 423, in hf_raise_for_status
    raise _format(GatedRepoError, message, response) from e
huggingface_hub.errors.GatedRepoError: 401 Client Error. (Request ID: Root=1-67377fbb-5fc35dbf6d3882202048c17d;da98cf33-f7e1-4af9-bcf8-5b925ac47dbf)

Cannot access gated repo for url https://huggingface.co/meta-llama/Meta-Llama-3.1-70B/resolve/main/config.json.
Access to model meta-llama/Llama-3.1-70B is restricted. You must have access to it and be authenticated to access it. Please log in.
tstescoTT commented 2 days ago

I can add something like HF login (https://huggingface.co/docs/huggingface_hub/en/quick-start#login-command):

from huggingface_hub import login

login()

Previously we didn't require a HF account to run the flask inference API server, but it's getting increasingly common to assume users will have this when using vLLM.