Add support to read different neuron cache from the local directory of the DLC based on ENVs

Neo9061 commented 6 months ago

Feature request

Under network isolation, SageMaker endpoint will not have access to the aws-neuron/optimum-neuron-cache to fetch cache.

Instead, we need pre-download the caches and model weights and send them as input for SageMaker endpoint deployment. During endpoint deployment, we expect the DLC read neuron caches from a local directory within DLC.

During previous communication with @dacorvo , below please find my minimal reproducible code.

import os
os.environ["HUGGINGFACE_HUB_CACHE"] = "<MY_LOCAL_DIR>"
!huggingface-cli download aws-neuron/optimum-neuron-cache --include "*neuronxcc-2.12.68*"
!huggingface-cli download meta-llama/Llama-2-7b-chat-hf

This will download all the neuron caches and llama-2 7b chat into MY_LOCAL_DIR, with directory structure as below.

-- models--aws-neuron--optimum-neuron-cache
-- models--meta-llama--Llama-2-7b-chat-hf

Then by using optimum-cli neuron cache lookup meta-llama/Llama-2-7b-hf I identified there is a neuron cache compiled for sequence length of 4096, batch size of 1, and neuron cores of 2, with neuronx-cc 2.12.68.0+4480452af

Further I upload MY_LOCAL_DIR into S3 bucket say MY_S3, and use following SDK code to deploy an endpoint.

import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
from sagemaker.model import Model

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]
sagemaker_session = sagemaker.Session(boto_session=boto3.Session(region_name="us-east-1"))
endpoint_name = "llama-2-7b"

model_data = {
    'S3DataSource': {
        'CompressionType': 'None',
        'S3DataType': 'S3Prefix',
        'S3Uri': "<MY_S3>",
    }
}

# Hub Model configuration. https://huggingface.co/models
hub = {
    "HF_MODEL_ID": "meta-llama/Llama-2-7b-chat-hf",
    "HF_NUM_CORES": "2", # number of neuron cores
    "HF_BATCH_SIZE": "4", # batch size used to compile the model
    "HF_SEQUENCE_LENGTH": "4096", # length used to compile the model
    "HF_AUTO_CAST_TYPE": "fp16",  # dtype of the model
    "MAX_BATCH_SIZE": "4", # max batch size for the model
    "MAX_INPUT_LENGTH": "2048", # max length of input text
    "MAX_TOTAL_TOKENS": "4096", # max length of generated text
    #"HF_TOKEN": HfFolder.get_token(), # pass the huggingface token
    "HUGGINGFACE_HUB_CACHE": "/opt/ml/model", # What David suggested
    "TRANSFORMERS_CACHE": "/tmp"
}

model = Model(
    image_uri="763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:1.13.1-optimum0.0.20-neuronx-py310-ubuntu22.04",
    model_data=model_data,
    role=role,
    sagemaker_session=sagemaker_session,
    name=endpoint_name,
    env=hub,
)

# deploy model to SageMaker Inference
predictor = model.deploy(
    initial_instance_count=1,
    instance_type="ml.inf2.8xlarge",
    endpoint_name=endpoint_name,
    volume_size=512,
    model_data_download_timeout=3600,
    container_startup_health_check_timeout=3600,
)

# send request
predictor.predict(
    {
        "inputs": "What is is the capital of France?",
    }
)

Error occurs during deployment and is shown as below. The error seems from trying to access a gated llama model and access denied. But with model weights included in the local dir of DLC, it shouldn't go to public internet to fetch it.

2024-03-28T01:20:31.908Z
#033[2m2024-03-28T01:20:31.082918Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Args { model_id: "meta-llama/Llama-2-7b-chat-hf", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 2048, max_total_tokens: 4096, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: Some(4), enable_cuda_graphs: false, hostname: "container-0.local", port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/opt/ml/model"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false }

Copy
#033[2m2024-03-28T01:20:31.082918Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Args { model_id: "meta-llama/Llama-2-7b-chat-hf", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 2048, max_total_tokens: 4096, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: Some(4), enable_cuda_graphs: false, hostname: "container-0.local", port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/opt/ml/model"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false }
2024-03-28T01:20:31.908Z
#033[2m2024-03-28T01:20:31.082998Z#033[0m #033[32m INFO#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting download process.

Copy
#033[2m2024-03-28T01:20:31.082998Z#033[0m #033[32m INFO#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting download process.
2024-03-28T01:20:32.907Z
#033[2m2024-03-28T01:20:31.174030Z#033[0m #033[33m WARN#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m 'extension' argument is not supported and will be ignored.

Copy
#033[2m2024-03-28T01:20:31.174030Z#033[0m #033[33m WARN#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m 'extension' argument is not supported and will be ignored.
2024-03-28T01:20:32.907Z
#033[2m2024-03-28T01:20:32.885603Z#033[0m #033[31mERROR#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download encountered an error: 

Copy
#033[2m2024-03-28T01:20:32.885603Z#033[0m #033[31mERROR#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download encountered an error:
2024-03-28T01:20:32.907Z
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:123: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(

Copy
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:123: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead. warnings.warn(
2024-03-28T01:20:32.907Z
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py", line 304, in hf_raise_for_status
    response.raise_for_status()
  File "/usr/local/lib/python3.10/dist-packages/requests/models.py", line 1021, in raise_for_status

Copy
Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py", line 304, in hf_raise_for_status response.raise_for_status() File "/usr/local/lib/python3.10/dist-packages/requests/models.py", line 1021, in raise_for_status
2024-03-28T01:20:32.907Z
Error: DownloadError
    raise HTTPError(http_error_msg, response=self)

Copy
Error: DownloadError raise HTTPError(http_error_msg, response=self)
2024-03-28T01:20:32.907Z
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/resolve/main/config.json

Copy
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/resolve/main/config.json
2024-03-28T01:20:32.907Z
The above exception was the direct cause of the following exception:

Copy
The above exception was the direct cause of the following exception:
2024-03-28T01:20:32.907Z
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 389, in cached_file
    resolved_file = hf_hub_download(
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1403, in hf_hub_download
    raise head_call_error
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1261, in hf_hub_download
    metadata = get_hf_file_metadata(
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1667, in get_hf_file_metadata
    r = _request_wrapper(
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 385, in _request_wrapper
    response = _request_wrapper(
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 409, in _request_wrapper
    hf_raise_for_status(response)
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py", line 321, in hf_raise_for_status
    raise GatedRepoError(message, response) from e

Copy
Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 389, in cached_file resolved_file = hf_hub_download( File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn return fn(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1403, in hf_hub_download raise head_call_error File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1261, in hf_hub_download metadata = get_hf_file_metadata( File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn return fn(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1667, in get_hf_file_metadata r = _request_wrapper( File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 385, in _request_wrapper response = _request_wrapper( File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 409, in _request_wrapper hf_raise_for_status(response) File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py", line 321, in hf_raise_for_status raise GatedRepoError(message, response) from e
2024-03-28T01:20:32.907Z
huggingface_hub.utils._errors.GatedRepoError: 401 Client Error. (Request ID: Root=1-6604c5e0-374ff03938f405767aab5d5d;00596397-d2ea-432e-a5cd-d0a0661b53e7)

Copy
huggingface_hub.utils._errors.GatedRepoError: 401 Client Error. (Request ID: Root=1-6604c5e0-374ff03938f405767aab5d5d;00596397-d2ea-432e-a5cd-d0a0661b53e7)
2024-03-28T01:20:32.907Z
Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/resolve/main/config.json.

Copy
Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/resolve/main/config.json.
2024-03-28T01:20:32.907Z
Repo model meta-llama/Llama-2-7b-chat-hf is gated. You must be authenticated to access it.

Copy
Repo model meta-llama/Llama-2-7b-chat-hf is gated. You must be authenticated to access it.
2024-03-28T01:20:32.907Z
The above exception was the direct cause of the following exception:

Copy
The above exception was the direct cause of the following exception:
2024-03-28T01:20:32.907Z
Traceback (most recent call last):
  File "/usr/local/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 101, in download_weights
    fetch_model(model_id, revision)
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/model.py", line 87, in fetch_model
    config = AutoConfig.from_pretrained(model_id, revision=revision)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 1082, in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 644, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 699, in _get_config_dict
    resolved_config_file = cached_file(
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 404, in cached_file
    raise EnvironmentError(

Copy
Traceback (most recent call last): File "/usr/local/bin/text-generation-server", line 8, in <module> sys.exit(app()) File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 101, in download_weights fetch_model(model_id, revision) File "/usr/local/lib/python3.10/dist-packages/text_generation_server/model.py", line 87, in fetch_model config = AutoConfig.from_pretrained(model_id, revision=revision) File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 1082, in from_pretrained config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 644, in get_config_dict config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 699, in _get_config_dict resolved_config_file = cached_file( File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 404, in cached_file raise EnvironmentError(
2024-03-28T01:20:32.907Z
OSError: You are trying to access a gated repo.

Copy
OSError: You are trying to access a gated repo.
2024-03-28T01:20:33.157Z
Make sure to request access at https://huggingface.co/meta-llama/Llama-2-7b-chat-hf and pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`.

Copy
Make sure to request access at https://huggingface.co/meta-llama/Llama-2-7b-chat-hf and pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`.
2024-03-28T01:20:33.157Z
#033[2m2024-03-28T01:20:33.048196Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Args { model_id: "meta-llama/Llama-2-7b-chat-hf", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 2048, max_total_tokens: 4096, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: Some(4), enable_cuda_graphs: false, hostname: "container-0.local", port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/opt/ml/model"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false }

Copy
#033[2m2024-03-28T01:20:33.048196Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Args { model_id: "meta-llama/Llama-2-7b-chat-hf", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 2048, max_total_tokens: 4096, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: Some(4), enable_cuda_graphs: false, hostname: "container-0.local", port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/opt/ml/model"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false }
2024-03-28T01:20:33.157Z
#033[2m2024-03-28T01:20:33.048279Z#033[0m #033[32m INFO#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting download process.

Copy
#033[2m2024-03-28T01:20:33.048279Z#033[0m #033[32m INFO#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting download process.
2024-03-28T01:20:34.911Z
#033[2m2024-03-28T01:20:33.139764Z#033[0m #033[33m WARN#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m 'extension' argument is not supported and will be ignored.

Copy
#033[2m2024-03-28T01:20:33.139764Z#033[0m #033[33m WARN#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m 'extension' argument is not supported and will be ignored.
2024-03-28T01:20:34.911Z
#033[2m2024-03-28T01:20:34.850506Z#033[0m #033[31mERROR#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download encountered an error: 

Copy
#033[2m2024-03-28T01:20:34.850506Z#033[0m #033[31mERROR#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download encountered an error:
2024-03-28T01:20:34.912Z
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:123: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(

Copy
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:123: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead. warnings.warn(
2024-03-28T01:20:34.912Z
Traceback (most recent call last):

Copy
Traceback (most recent call last):
2024-03-28T01:20:34.912Z
Error: DownloadError
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py", line 304, in hf_raise_for_status
    response.raise_for_status()
  File "/usr/local/lib/python3.10/dist-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)

Copy
Error: DownloadError File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py", line 304, in hf_raise_for_status response.raise_for_status() File "/usr/local/lib/python3.10/dist-packages/requests/models.py", line 1021, in raise_for_status raise HTTPError(http_error_msg, response=self)
2024-03-28T01:20:34.912Z
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/resolve/main/config.json

Copy
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/resolve/main/config.json
2024-03-28T01:20:34.912Z
The above exception was the direct cause of the following exception:

Copy
The above exception was the direct cause of the following exception:
2024-03-28T01:20:34.912Z
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 389, in cached_file
    resolved_file = hf_hub_download(
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1403, in hf_hub_download
    raise head_call_error
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1261, in hf_hub_download
    metadata = get_hf_file_metadata(
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1667, in get_hf_file_metadata
    r = _request_wrapper(
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 385, in _request_wrapper
    response = _request_wrapper(
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 409, in _request_wrapper
    hf_raise_for_status(response)
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py", line 321, in hf_raise_for_status
    raise GatedRepoError(message, response) from e

Copy
Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 389, in cached_file resolved_file = hf_hub_download( File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn return fn(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1403, in hf_hub_download raise head_call_error File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1261, in hf_hub_download metadata = get_hf_file_metadata( File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn return fn(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1667, in get_hf_file_metadata r = _request_wrapper( File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 385, in _request_wrapper response = _request_wrapper( File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 409, in _request_wrapper hf_raise_for_status(response) File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py", line 321, in hf_raise_for_status raise GatedRepoError(message, response) from e
2024-03-28T01:20:34.912Z
huggingface_hub.utils._errors.GatedRepoError: 401 Client Error. (Request ID: Root=1-6604c5e2-70a6326a6a7d12926d9f955b;b5bb2db4-fbed-4b3f-b67c-8da67d35b816)

Copy
huggingface_hub.utils._errors.GatedRepoError: 401 Client Error. (Request ID: Root=1-6604c5e2-70a6326a6a7d12926d9f955b;b5bb2db4-fbed-4b3f-b67c-8da67d35b816)
2024-03-28T01:20:34.912Z
Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/resolve/main/config.json.

Copy
Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/resolve/main/config.json.
2024-03-28T01:20:34.912Z
Repo model meta-llama/Llama-2-7b-chat-hf is gated. You must be authenticated to access it.

Copy
Repo model meta-llama/Llama-2-7b-chat-hf is gated. You must be authenticated to access it.
2024-03-28T01:20:34.912Z
The above exception was the direct cause of the following exception:

Copy
The above exception was the direct cause of the following exception:
2024-03-28T01:20:34.912Z
Traceback (most recent call last):
  File "/usr/local/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 101, in download_weights
    fetch_model(model_id, revision)
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/model.py", line 87, in fetch_model
    config = AutoConfig.from_pretrained(model_id, revision=revision)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 1082, in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 644, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 699, in _get_config_dict
    resolved_config_file = cached_file(
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 404, in cached_file
    raise EnvironmentError(

Copy
Traceback (most recent call last): File "/usr/local/bin/text-generation-server", line 8, in <module> sys.exit(app()) File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 101, in download_weights fetch_model(model_id, revision) File "/usr/local/lib/python3.10/dist-packages/text_generation_server/model.py", line 87, in fetch_model config = AutoConfig.from_pretrained(model_id, revision=revision) File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 1082, in from_pretrained config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 644, in get_config_dict config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 699, in _get_config_dict resolved_config_file = cached_file( File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 404, in cached_file raise EnvironmentError(
2024-03-28T01:20:34.912Z
OSError: You are trying to access a gated repo.

Copy
OSError: You are trying to access a gated repo.
2024-03-28T01:20:35.162Z
Make sure to request access at https://huggingface.co/meta-llama/Llama-2-7b-chat-hf and pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`.

Copy
Make sure to request access at https://huggingface.co/meta-llama/Llama-2-7b-chat-hf and pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`.
2024-03-28T01:20:35.162Z
#033[2m2024-03-28T01:20:35.111808Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Args { model_id: "meta-llama/Llama-2-7b-chat-hf", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 2048, max_total_tokens: 4096, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: Some(4), enable_cuda_graphs: false, hostname: "container-0.local", port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/opt/ml/model"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false }

Copy
#033[2m2024-03-28T01:20:35.111808Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Args { model_id: "meta-llama/Llama-2-7b-chat-hf", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 2048, max_total_tokens: 4096, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: Some(4), enable_cuda_graphs: false, hostname: "container-0.local", port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/opt/ml/model"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false }
2024-03-28T01:20:35.412Z
#033[2m2024-03-28T01:20:35.111885Z#033[0m #033[32m INFO#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting download process.

Copy
#033[2m2024-03-28T01:20:35.111885Z#033[0m #033[32m INFO#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting download process.
2024-03-28T01:20:37.180Z
#033[2m2024-03-28T01:20:35.202901Z#033[0m #033[33m WARN#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m 'extension' argument is not supported and will be ignored.

Copy
#033[2m2024-03-28T01:20:35.202901Z#033[0m #033[33m WARN#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m 'extension' argument is not supported and will be ignored.
2024-03-28T01:20:37.180Z
#033[2m2024-03-28T01:20:36.914374Z#033[0m #033[31mERROR#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download encountered an error: 

Copy
#033[2m2024-03-28T01:20:36.914374Z#033[0m #033[31mERROR#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download encountered an error:
2024-03-28T01:20:37.180Z
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:123: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(

Copy
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:123: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead. warnings.warn(
2024-03-28T01:20:37.180Z
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py", line 304, in hf_raise_for_status
    response.raise_for_status()
  File "/usr/local/lib/python3.10/dist-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)

Copy
Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py", line 304, in hf_raise_for_status response.raise_for_status() File "/usr/local/lib/python3.10/dist-packages/requests/models.py", line 1021, in raise_for_status raise HTTPError(http_error_msg, response=self)
2024-03-28T01:20:37.180Z
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/resolve/main/config.json

Copy
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/resolve/main/config.json
2024-03-28T01:20:37.180Z
The above exception was the direct cause of the following exception:

Copy
The above exception was the direct cause of the following exception:
2024-03-28T01:20:37.180Z
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 389, in cached_file
    resolved_file = hf_hub_download(
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn

Copy
Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 389, in cached_file resolved_file = hf_hub_download( File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
2024-03-28T01:20:37.180Z
Error: DownloadError
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1403, in hf_hub_download
    raise head_call_error
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1261, in hf_hub_download
    metadata = get_hf_file_metadata(
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1667, in get_hf_file_metadata
    r = _request_wrapper(
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 385, in _request_wrapper
    response = _request_wrapper(
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 409, in _request_wrapper
    hf_raise_for_status(response)
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py", line 321, in hf_raise_for_status
    raise GatedRepoError(message, response) from e

Copy
Error: DownloadError return fn(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1403, in hf_hub_download raise head_call_error File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1261, in hf_hub_download metadata = get_hf_file_metadata( File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn return fn(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1667, in get_hf_file_metadata r = _request_wrapper( File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 385, in _request_wrapper response = _request_wrapper( File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 409, in _request_wrapper hf_raise_for_status(response) File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py", line 321, in hf_raise_for_status raise GatedRepoError(message, response) from e
2024-03-28T01:20:37.180Z
huggingface_hub.utils._errors.GatedRepoError: 401 Client Error. (Request ID: Root=1-6604c5e4-246b123f12b9734114e96ae5;6400278b-c5c5-4ed2-a036-54d8a69c5674)

Copy
huggingface_hub.utils._errors.GatedRepoError: 401 Client Error. (Request ID: Root=1-6604c5e4-246b123f12b9734114e96ae5;6400278b-c5c5-4ed2-a036-54d8a69c5674)
2024-03-28T01:20:37.180Z
Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/resolve/main/config.json.

Copy
Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/resolve/main/config.json.
2024-03-28T01:20:37.180Z
Repo model meta-llama/Llama-2-7b-chat-hf is gated. You must be authenticated to access it.

Copy
Repo model meta-llama/Llama-2-7b-chat-hf is gated. You must be authenticated to access it.
2024-03-28T01:20:37.180Z
The above exception was the direct cause of the following exception:

Copy
The above exception was the direct cause of the following exception:
2024-03-28T01:20:37.180Z
Traceback (most recent call last):
  File "/usr/local/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 101, in download_weights
    fetch_model(model_id, revision)
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/model.py", line 87, in fetch_model
    config = AutoConfig.from_pretrained(model_id, revision=revision)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 1082, in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 644, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 699, in _get_config_dict
    resolved_config_file = cached_file(
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 404, in cached_file
    raise EnvironmentError(

Copy
Traceback (most recent call last): File "/usr/local/bin/text-generation-server", line 8, in <module> sys.exit(app()) File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 101, in download_weights fetch_model(model_id, revision) File "/usr/local/lib/python3.10/dist-packages/text_generation_server/model.py", line 87, in fetch_model config = AutoConfig.from_pretrained(model_id, revision=revision) File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 1082, in from_pretrained config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 644, in get_config_dict config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 699, in _get_config_dict resolved_config_file = cached_file( File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 404, in cached_file raise EnvironmentError(
2024-03-28T01:20:37.180Z
OSError: You are trying to access a gated repo.

Copy
OSError: You are trying to access a gated repo.
2024-03-28T01:20:37.667Z
Make sure to request access at https://huggingface.co/meta-llama/Llama-2-7b-chat-hf and pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`.

Copy
Make sure to request access at https://huggingface.co/meta-llama/Llama-2-7b-chat-hf and pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`.
2024-03-28T01:20:37.667Z
#033[2m2024-03-28T01:20:37.609995Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Args { model_id: "meta-llama/Llama-2-7b-chat-hf", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 2048, max_total_tokens: 4096, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: Some(4), enable_cuda_graphs: false, hostname: "container-0.local", port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/opt/ml/model"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false }

Copy
#033[2m2024-03-28T01:20:37.609995Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Args { model_id: "meta-llama/Llama-2-7b-chat-hf", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 2048, max_total_tokens: 4096, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: Some(4), enable_cuda_graphs: false, hostname: "container-0.local", port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/opt/ml/model"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false }
2024-03-28T01:20:37.918Z
#033[2m2024-03-28T01:20:37.610075Z#033[0m #033[32m INFO#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting download process.

Copy
#033[2m2024-03-28T01:20:37.610075Z#033[0m #033[32m INFO#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting download process.
2024-03-28T01:20:39.421Z
#033[2m2024-03-28T01:20:37.701299Z#033[0m #033[33m WARN#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m 'extension' argument is not supported and will be ignored.

Copy
#033[2m2024-03-28T01:20:37.701299Z#033[0m #033[33m WARN#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m 'extension' argument is not supported and will be ignored.
2024-03-28T01:20:39.421Z
#033[2m2024-03-28T01:20:39.412447Z#033[0m #033[31mERROR#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download encountered an error: 

Copy
#033[2m2024-03-28T01:20:39.412447Z#033[0m #033[31mERROR#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download encountered an error:
2024-03-28T01:20:39.421Z
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:123: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.

Copy
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:123: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
2024-03-28T01:20:39.421Z
Error: DownloadError
  warnings.warn(

Copy
Error: DownloadError warnings.warn(
2024-03-28T01:20:39.421Z
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py", line 304, in hf_raise_for_status
    response.raise_for_status()
  File "/usr/local/lib/python3.10/dist-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)

Copy
Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py", line 304, in hf_raise_for_status response.raise_for_status() File "/usr/local/lib/python3.10/dist-packages/requests/models.py", line 1021, in raise_for_status raise HTTPError(http_error_msg, response=self)
2024-03-28T01:20:39.421Z
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/resolve/main/config.json

Copy
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/resolve/main/config.json
2024-03-28T01:20:39.421Z
The above exception was the direct cause of the following exception:

Copy
The above exception was the direct cause of the following exception:
2024-03-28T01:20:39.421Z
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 389, in cached_file
    resolved_file = hf_hub_download(
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1403, in hf_hub_download
    raise head_call_error
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1261, in hf_hub_download
    metadata = get_hf_file_metadata(
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1667, in get_hf_file_metadata
    r = _request_wrapper(
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 385, in _request_wrapper
    response = _request_wrapper(
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 409, in _request_wrapper
    hf_raise_for_status(response)
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py", line 321, in hf_raise_for_status
    raise GatedRepoError(message, response) from e

Copy
Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 389, in cached_file resolved_file = hf_hub_download( File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn return fn(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1403, in hf_hub_download raise head_call_error File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1261, in hf_hub_download metadata = get_hf_file_metadata( File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn return fn(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1667, in get_hf_file_metadata r = _request_wrapper( File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 385, in _request_wrapper response = _request_wrapper( File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 409, in _request_wrapper hf_raise_for_status(response) File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py", line 321, in hf_raise_for_status raise GatedRepoError(message, response) from e
2024-03-28T01:20:39.421Z
huggingface_hub.utils._errors.GatedRepoError: 401 Client Error. (Request ID: Root=1-6604c5e7-1e34c6bd142762892a5ebdef;ae197c89-158e-47ec-a1b0-b19f8048c2a0)

Copy
huggingface_hub.utils._errors.GatedRepoError: 401 Client Error. (Request ID: Root=1-6604c5e7-1e34c6bd142762892a5ebdef;ae197c89-158e-47ec-a1b0-b19f8048c2a0)
2024-03-28T01:20:39.421Z
Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/resolve/main/config.json.

Copy
Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/resolve/main/config.json.
2024-03-28T01:20:39.421Z
Repo model meta-llama/Llama-2-7b-chat-hf is gated. You must be authenticated to access it.

Copy
Repo model meta-llama/Llama-2-7b-chat-hf is gated. You must be authenticated to access it.
2024-03-28T01:20:39.421Z
The above exception was the direct cause of the following exception:

Copy
The above exception was the direct cause of the following exception:
2024-03-28T01:20:39.421Z
Traceback (most recent call last):
  File "/usr/local/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 101, in download_weights
    fetch_model(model_id, revision)
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/model.py", line 87, in fetch_model
    config = AutoConfig.from_pretrained(model_id, revision=revision)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 1082, in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 644, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 699, in _get_config_dict
    resolved_config_file = cached_file(
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 404, in cached_file
    raise EnvironmentError(

Copy
Traceback (most recent call last): File "/usr/local/bin/text-generation-server", line 8, in <module> sys.exit(app()) File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 101, in download_weights fetch_model(model_id, revision) File "/usr/local/lib/python3.10/dist-packages/text_generation_server/model.py", line 87, in fetch_model config = AutoConfig.from_pretrained(model_id, revision=revision) File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 1082, in from_pretrained config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 644, in get_config_dict config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 699, in _get_config_dict resolved_config_file = cached_file( File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 404, in cached_file raise EnvironmentError(
2024-03-28T01:20:39.421Z
OSError: You are trying to access a gated repo.

Copy
OSError: You are trying to access a gated repo.
2024-03-28T01:20:40.423Z
Make sure to request access at https://huggingface.co/meta-llama/Llama-2-7b-chat-hf and pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`.

Copy
Make sure to request access at https://huggingface.co/meta-llama/Llama-2-7b-chat-hf and pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`.
2024-03-28T01:20:40.423Z
#033[2m2024-03-28T01:20:40.274297Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Args { model_id: "meta-llama/Llama-2-7b-chat-hf", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 2048, max_total_tokens: 4096, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: Some(4), enable_cuda_graphs: false, hostname: "container-0.local", port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/opt/ml/model"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false }

Copy
#033[2m2024-03-28T01:20:40.274297Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Args { model_id: "meta-llama/Llama-2-7b-chat-hf", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 2048, max_total_tokens: 4096, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: Some(4), enable_cuda_graphs: false, hostname: "container-0.local", port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/opt/ml/model"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false }
2024-03-28T01:20:40.423Z
#033[2m2024-03-28T01:20:40.274380Z#033[0m #033[32m INFO#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting download process.

Copy
#033[2m2024-03-28T01:20:40.274380Z#033[0m #033[32m INFO#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting download process.
2024-03-28T01:20:42.178Z
#033[2m2024-03-28T01:20:40.366345Z#033[0m #033[33m WARN#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m 'extension' argument is not supported and will be ignored.

Copy
#033[2m2024-03-28T01:20:40.366345Z#033[0m #033[33m WARN#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m 'extension' argument is not supported and will be ignored.
2024-03-28T01:20:42.178Z
#033[2m2024-03-28T01:20:42.076773Z#033[0m #033[31mERROR#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download encountered an error: 

Copy
#033[2m2024-03-28T01:20:42.076773Z#033[0m #033[31mERROR#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download encountered an error:
2024-03-28T01:20:42.178Z
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:123: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(

Copy
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:123: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead. warnings.warn(
2024-03-28T01:20:42.178Z
Traceback (most recent call last):

Copy
Traceback (most recent call last):
2024-03-28T01:20:42.178Z
Error: DownloadError
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py", line 304, in hf_raise_for_status
    response.raise_for_status()
  File "/usr/local/lib/python3.10/dist-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)

Copy
Error: DownloadError File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py", line 304, in hf_raise_for_status response.raise_for_status() File "/usr/local/lib/python3.10/dist-packages/requests/models.py", line 1021, in raise_for_status raise HTTPError(http_error_msg, response=self)
2024-03-28T01:20:42.178Z
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/resolve/main/config.json

Copy
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/resolve/main/config.json
2024-03-28T01:20:42.178Z
The above exception was the direct cause of the following exception:

Copy
The above exception was the direct cause of the following exception:
2024-03-28T01:20:42.178Z
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 389, in cached_file
    resolved_file = hf_hub_download(
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1403, in hf_hub_download
    raise head_call_error
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1261, in hf_hub_download
    metadata = get_hf_file_metadata(
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1667, in get_hf_file_metadata
    r = _request_wrapper(
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 385, in _request_wrapper
    response = _request_wrapper(
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 409, in _request_wrapper
    hf_raise_for_status(response)
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py", line 321, in hf_raise_for_status
    raise GatedRepoError(message, response) from e

Copy
Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 389, in cached_file resolved_file = hf_hub_download( File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn return fn(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1403, in hf_hub_download raise head_call_error File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1261, in hf_hub_download metadata = get_hf_file_metadata( File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn return fn(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1667, in get_hf_file_metadata r = _request_wrapper( File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 385, in _request_wrapper response = _request_wrapper( File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 409, in _request_wrapper hf_raise_for_status(response) File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py", line 321, in hf_raise_for_status raise GatedRepoError(message, response) from e
2024-03-28T01:20:42.178Z
huggingface_hub.utils._errors.GatedRepoError: 401 Client Error. (Request ID: Root=1-6604c5e9-3fc1188c687dda652e87a12c;eacfdaa2-8103-4044-b56a-9b8883e83b5e)

Copy
huggingface_hub.utils._errors.GatedRepoError: 401 Client Error. (Request ID: Root=1-6604c5e9-3fc1188c687dda652e87a12c;eacfdaa2-8103-4044-b56a-9b8883e83b5e)
2024-03-28T01:20:42.178Z
Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/resolve/main/config.json.

Copy
Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/resolve/main/config.json.
2024-03-28T01:20:42.178Z
Repo model meta-llama/Llama-2-7b-chat-hf is gated. You must be authenticated to access it.

Copy
Repo model meta-llama/Llama-2-7b-chat-hf is gated. You must be authenticated to access it.
2024-03-28T01:20:42.178Z
The above exception was the direct cause of the following exception:

Copy
The above exception was the direct cause of the following exception:
2024-03-28T01:20:42.178Z
Traceback (most recent call last):
  File "/usr/local/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 101, in download_weights
    fetch_model(model_id, revision)
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/model.py", line 87, in fetch_model
    config = AutoConfig.from_pretrained(model_id, revision=revision)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 1082, in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 644, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 699, in _get_config_dict
    resolved_config_file = cached_file(
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 404, in cached_file
    raise EnvironmentError(

Motivation

To better use HF NeuronX DLC for commercial usage.

Your contribution

NA.

dacorvo commented 6 months ago

@Neo9061 did you set the HF_HUB_OFFLINE variable to True as I suggested in the first place ?

Neo9061 commented 6 months ago

@dacorvo No I didn't recall it. Let me try it now.

Neo9061 commented 6 months ago

@dacorvo with HF_HUB_OFFLINE enabled, we have another error. Please see below.

2024-03-28T14:46:26.258Z
#033[2m2024-03-28T14:46:25.480720Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Args { model_id: "meta-llama/Llama-2-7b-chat-hf", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 2048, max_total_tokens: 4096, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: Some(4), enable_cuda_graphs: false, hostname: "container-0.local", port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/opt/ml/model"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false }

Copy
#033[2m2024-03-28T14:46:25.480720Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Args { model_id: "meta-llama/Llama-2-7b-chat-hf", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 2048, max_total_tokens: 4096, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: Some(4), enable_cuda_graphs: false, hostname: "container-0.local", port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/opt/ml/model"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false }
2024-03-28T14:46:26.258Z
#033[2m2024-03-28T14:46:25.480802Z#033[0m #033[32m INFO#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting download process.

Copy
#033[2m2024-03-28T14:46:25.480802Z#033[0m #033[32m INFO#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting download process.
2024-03-28T14:46:27.255Z
#033[2m2024-03-28T14:46:25.572628Z#033[0m #033[33m WARN#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m 'extension' argument is not supported and will be ignored.

Copy
#033[2m2024-03-28T14:46:25.572628Z#033[0m #033[33m WARN#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m 'extension' argument is not supported and will be ignored.
2024-03-28T14:46:27.255Z
Error: DownloadError

Copy
Error: DownloadError
2024-03-28T14:46:27.255Z
#033[2m2024-03-28T14:46:27.183593Z#033[0m #033[31mERROR#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download encountered an error: 

Copy
#033[2m2024-03-28T14:46:27.183593Z#033[0m #033[31mERROR#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download encountered an error:
2024-03-28T14:46:27.255Z
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:123: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(

Copy
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:123: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead. warnings.warn(
2024-03-28T14:46:27.255Z
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1261, in hf_hub_download
    metadata = get_hf_file_metadata(
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1667, in get_hf_file_metadata
    r = _request_wrapper(
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 385, in _request_wrapper
    response = _request_wrapper(
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 408, in _request_wrapper
    response = get_session().request(method=method, url=url, **params)
  File "/usr/local/lib/python3.10/dist-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_http.py", line 78, in send
    raise OfflineModeIsEnabled(

Copy
Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1261, in hf_hub_download metadata = get_hf_file_metadata( File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn return fn(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1667, in get_hf_file_metadata r = _request_wrapper( File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 385, in _request_wrapper response = _request_wrapper( File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 408, in _request_wrapper response = get_session().request(method=method, url=url, **params) File "/usr/local/lib/python3.10/dist-packages/requests/sessions.py", line 589, in request resp = self.send(prep, **send_kwargs) File "/usr/local/lib/python3.10/dist-packages/requests/sessions.py", line 703, in send r = adapter.send(request, **kwargs) File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_http.py", line 78, in send raise OfflineModeIsEnabled(
2024-03-28T14:46:27.255Z
huggingface_hub.utils._http.OfflineModeIsEnabled: Cannot reach https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/resolve/main/config.json: offline mode is enabled. To disable it, please unset the `HF_HUB_OFFLINE` environment variable.

Copy
huggingface_hub.utils._http.OfflineModeIsEnabled: Cannot reach https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/resolve/main/config.json: offline mode is enabled. To disable it, please unset the `HF_HUB_OFFLINE` environment variable.
2024-03-28T14:46:27.255Z
The above exception was the direct cause of the following exception:

Copy
The above exception was the direct cause of the following exception:
2024-03-28T14:46:27.255Z
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 389, in cached_file
    resolved_file = hf_hub_download(
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1406, in hf_hub_download
    raise LocalEntryNotFoundError(

Copy
Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 389, in cached_file resolved_file = hf_hub_download( File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn return fn(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1406, in hf_hub_download raise LocalEntryNotFoundError(
2024-03-28T14:46:27.255Z
huggingface_hub.utils._errors.LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on.

Copy
huggingface_hub.utils._errors.LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on.
2024-03-28T14:46:27.255Z
The above exception was the direct cause of the following exception:

Copy
The above exception was the direct cause of the following exception:
2024-03-28T14:46:27.255Z
Traceback (most recent call last):
  File "/usr/local/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 101, in download_weights
    fetch_model(model_id, revision)
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/model.py", line 87, in fetch_model
    config = AutoConfig.from_pretrained(model_id, revision=revision)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 1082, in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 644, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 699, in _get_config_dict
    resolved_config_file = cached_file(
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 429, in cached_file
    raise EnvironmentError(

Copy
Traceback (most recent call last): File "/usr/local/bin/text-generation-server", line 8, in <module> sys.exit(app()) File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 101, in download_weights fetch_model(model_id, revision) File "/usr/local/lib/python3.10/dist-packages/text_generation_server/model.py", line 87, in fetch_model config = AutoConfig.from_pretrained(model_id, revision=revision) File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 1082, in from_pretrained config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 644, in get_config_dict config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 699, in _get_config_dict resolved_config_file = cached_file( File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 429, in cached_file raise EnvironmentError(
2024-03-28T14:46:27.255Z
OSError: We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like meta-llama/Llama-2-7b-chat-hf is not the path to a directory containing a file named config.json.

Neo9061 commented 6 months ago

A side question @dacorvo, imagine multiple caches reading is working. When I prepare the neuron caches, I only want neuron caches for that particular models and potentially for the configurations that I selected. Is there an easy way to prepare such selected neuron caches? rather than downloading from aws-neuron/optimum-neuron cache? As aws-neuron/optimum-neuron cache has over 60GB of caches, and I cannot include all of them into a model's model input data.

dacorvo commented 6 months ago

@Neo9061 I have difficulties following what you are actually trying to do: if you want to have access to neuron models offline, then just export them and pass the path to the exported model directory instead of the model_id. I don't understand why you would want to use the cache, which is a convoluted way to obtain the exact same thing. What are you trying to achieve ?

Neo9061 commented 6 months ago

Let me clarify a bit more: based on this blogpost, where you essentially can configure a same model with different ENVs of sequence length, batch size, and tensor parallel degree, we want to make it work under network isolation.

Thus, when a user specify a config via ENVs, the DLC will find the corresponded pre-compiled neuron caches and start the an endpoint. To achieve that, we need store multiple neuron caches in local dir within DLC, along with model weights. And my follow up question in the above thread is how to filter or trim the neuron cache files to a set of pre-selected configs -- as including all of configs for neuron caches are too big (60GB in aws-neuron/optimum-neuron)