Open Neo9061 opened 6 months ago
@Neo9061 did you set the HF_HUB_OFFLINE
variable to True as I suggested in the first place ?
@dacorvo No I didn't recall it. Let me try it now.
@dacorvo with HF_HUB_OFFLINE enabled, we have another error. Please see below.
2024-03-28T14:46:26.258Z
#033[2m2024-03-28T14:46:25.480720Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Args { model_id: "meta-llama/Llama-2-7b-chat-hf", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 2048, max_total_tokens: 4096, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: Some(4), enable_cuda_graphs: false, hostname: "container-0.local", port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/opt/ml/model"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false }
Copy
#033[2m2024-03-28T14:46:25.480720Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Args { model_id: "meta-llama/Llama-2-7b-chat-hf", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 2048, max_total_tokens: 4096, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: Some(4), enable_cuda_graphs: false, hostname: "container-0.local", port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/opt/ml/model"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false }
2024-03-28T14:46:26.258Z
#033[2m2024-03-28T14:46:25.480802Z#033[0m #033[32m INFO#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting download process.
Copy
#033[2m2024-03-28T14:46:25.480802Z#033[0m #033[32m INFO#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting download process.
2024-03-28T14:46:27.255Z
#033[2m2024-03-28T14:46:25.572628Z#033[0m #033[33m WARN#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m 'extension' argument is not supported and will be ignored.
Copy
#033[2m2024-03-28T14:46:25.572628Z#033[0m #033[33m WARN#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m 'extension' argument is not supported and will be ignored.
2024-03-28T14:46:27.255Z
Error: DownloadError
Copy
Error: DownloadError
2024-03-28T14:46:27.255Z
#033[2m2024-03-28T14:46:27.183593Z#033[0m #033[31mERROR#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download encountered an error:
Copy
#033[2m2024-03-28T14:46:27.183593Z#033[0m #033[31mERROR#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Download encountered an error:
2024-03-28T14:46:27.255Z
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:123: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
warnings.warn(
Copy
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:123: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead. warnings.warn(
2024-03-28T14:46:27.255Z
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1261, in hf_hub_download
metadata = get_hf_file_metadata(
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1667, in get_hf_file_metadata
r = _request_wrapper(
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 385, in _request_wrapper
response = _request_wrapper(
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 408, in _request_wrapper
response = get_session().request(method=method, url=url, **params)
File "/usr/local/lib/python3.10/dist-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python3.10/dist-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_http.py", line 78, in send
raise OfflineModeIsEnabled(
Copy
Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1261, in hf_hub_download metadata = get_hf_file_metadata( File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn return fn(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1667, in get_hf_file_metadata r = _request_wrapper( File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 385, in _request_wrapper response = _request_wrapper( File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 408, in _request_wrapper response = get_session().request(method=method, url=url, **params) File "/usr/local/lib/python3.10/dist-packages/requests/sessions.py", line 589, in request resp = self.send(prep, **send_kwargs) File "/usr/local/lib/python3.10/dist-packages/requests/sessions.py", line 703, in send r = adapter.send(request, **kwargs) File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_http.py", line 78, in send raise OfflineModeIsEnabled(
2024-03-28T14:46:27.255Z
huggingface_hub.utils._http.OfflineModeIsEnabled: Cannot reach https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/resolve/main/config.json: offline mode is enabled. To disable it, please unset the `HF_HUB_OFFLINE` environment variable.
Copy
huggingface_hub.utils._http.OfflineModeIsEnabled: Cannot reach https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/resolve/main/config.json: offline mode is enabled. To disable it, please unset the `HF_HUB_OFFLINE` environment variable.
2024-03-28T14:46:27.255Z
The above exception was the direct cause of the following exception:
Copy
The above exception was the direct cause of the following exception:
2024-03-28T14:46:27.255Z
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 389, in cached_file
resolved_file = hf_hub_download(
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1406, in hf_hub_download
raise LocalEntryNotFoundError(
Copy
Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 389, in cached_file resolved_file = hf_hub_download( File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn return fn(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1406, in hf_hub_download raise LocalEntryNotFoundError(
2024-03-28T14:46:27.255Z
huggingface_hub.utils._errors.LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on.
Copy
huggingface_hub.utils._errors.LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on.
2024-03-28T14:46:27.255Z
The above exception was the direct cause of the following exception:
Copy
The above exception was the direct cause of the following exception:
2024-03-28T14:46:27.255Z
Traceback (most recent call last):
File "/usr/local/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 101, in download_weights
fetch_model(model_id, revision)
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/model.py", line 87, in fetch_model
config = AutoConfig.from_pretrained(model_id, revision=revision)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 1082, in from_pretrained
config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 644, in get_config_dict
config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 699, in _get_config_dict
resolved_config_file = cached_file(
File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 429, in cached_file
raise EnvironmentError(
Copy
Traceback (most recent call last): File "/usr/local/bin/text-generation-server", line 8, in <module> sys.exit(app()) File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 101, in download_weights fetch_model(model_id, revision) File "/usr/local/lib/python3.10/dist-packages/text_generation_server/model.py", line 87, in fetch_model config = AutoConfig.from_pretrained(model_id, revision=revision) File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 1082, in from_pretrained config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 644, in get_config_dict config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 699, in _get_config_dict resolved_config_file = cached_file( File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 429, in cached_file raise EnvironmentError(
2024-03-28T14:46:27.255Z
OSError: We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like meta-llama/Llama-2-7b-chat-hf is not the path to a directory containing a file named config.json.
A side question @dacorvo, imagine multiple caches reading is working. When I prepare the neuron caches, I only want neuron caches for that particular models and potentially for the configurations that I selected. Is there an easy way to prepare such selected neuron caches? rather than downloading from aws-neuron/optimum-neuron cache
? As aws-neuron/optimum-neuron cache
has over 60GB of caches, and I cannot include all of them into a model's model input data.
@Neo9061 I have difficulties following what you are actually trying to do: if you want to have access to neuron models offline, then just export them and pass the path to the exported model directory instead of the model_id. I don't understand why you would want to use the cache, which is a convoluted way to obtain the exact same thing. What are you trying to achieve ?
Let me clarify a bit more: based on this blogpost, where you essentially can configure a same model with different ENVs of sequence length, batch size, and tensor parallel degree, we want to make it work under network isolation.
Thus, when a user specify a config via ENVs, the DLC will find the corresponded pre-compiled neuron caches and start the an endpoint. To achieve that, we need store multiple neuron caches in local dir within DLC, along with model weights. And my follow up question in the above thread is how to filter or trim the neuron cache files to a set of pre-selected configs -- as including all of configs for neuron caches are too big (60GB in aws-neuron/optimum-neuron
)
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!
Feature request
Under network isolation, SageMaker endpoint will not have access to the
aws-neuron/optimum-neuron-cache
to fetch cache.Instead, we need pre-download the caches and model weights and send them as input for SageMaker endpoint deployment. During endpoint deployment, we expect the DLC read neuron caches from a local directory within DLC.
During previous communication with @dacorvo , below please find my minimal reproducible code.
This will download all the neuron caches and llama-2 7b chat into
MY_LOCAL_DIR
, with directory structure as below.Then by using
optimum-cli neuron cache lookup meta-llama/Llama-2-7b-hf
I identified there is a neuron cache compiled for sequence length of 4096, batch size of 1, and neuron cores of 2, withneuronx-cc 2.12.68.0+4480452af
Further I upload
MY_LOCAL_DIR
into S3 bucket sayMY_S3
, and use following SDK code to deploy an endpoint.Error occurs during deployment and is shown as below. The error seems from trying to access a gated llama model and access denied. But with model weights included in the local dir of DLC, it shouldn't go to public internet to fetch it.
Motivation
To better use HF NeuronX DLC for commercial usage.
Your contribution
NA.