Unable to deploy Llama2 70B-chat as AWS Sagemaker endpoint - HF_API_TOKEN parameter does not authenticate

AlexHandy1 commented 1 year ago

System Info

AWS Sagemaker v.2.163.0

Target Sagemaker endpoint compute configuration settings

instance_type = ml.g5.12xlarge
number_of_gpu = 4

Full AWS Sagemaker notebook setup code included below in "Reproduction" on ml.t2.medium notebook and conda_pytorch_p310.

Information

[ ] Docker
[ ] The CLI directly

Tasks

[ ] An officially supported command
[X] My own modifications

Reproduction

Code run on an AWS Sagemaker notebook where trying to deploy meta-llama/Llama-2-70b-chat-hf to an AWS Sagemaker endpoint using an HuggingFace LLM Inference Container. Based off Falcon 40B deployment code outlined by @philschmid in this blog post (Note: deployed Falcon 40B successfully using this code where no authentication required). Access approval from meta and hugging face has already been acquired and linked to same email / account.

# install supported sagemaker SDK
!pip install "sagemaker==2.163.0" --upgrade --quiet

import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {sess.boto_region_name}")

from sagemaker.huggingface import get_huggingface_llm_image_uri

# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
  "huggingface",
  version="0.8.2"
)

# print ecr image uri
print(f"llm image uri: {llm_image}")

import json
from sagemaker.huggingface import HuggingFaceModel

# sagemaker config
instance_type = "ml.g5.12xlarge"
number_of_gpu = 4
health_check_timeout = 500

#set with read access [include the full token in version that run]
hf_api_token = "hf_xxx"

# TGI config
config = {
  'HF_MODEL_ID': "meta-llama/Llama-2-70b-chat-hf", # model_id from hf.co/models 
  'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
  'HF_API_TOKEN': json.dumps(hf_api_token), 
}

# create HuggingFaceModel
llm_model = HuggingFaceModel(
  role=role,
  image_uri=llm_image,
  env=config
)

# Deploy model to an endpoint
# https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy
llm = llm_model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
  container_startup_health_check_timeout=health_check_timeout,
)

Error message in CloudWatch logs

Error: DownloadError
huggingface_hub.utils._errors.GatedRepoError: 401 Client Error.
Cannot access gated repo for url https://huggingface.co/api/models/meta-llama/Llama-2-70b-chat-hf.
Repo model meta-llama/Llama-2-70b-chat-hf is gated. You must be authenticated to access it.

Expected behavior

Running the above code on an AWS Sagemaker notebook creates an AWS Sagemaker endpoint which hosts the LLama2 70b-chat model. Expecting that the 'HF_API_TOKEN' parameter will handle the requirement to authenticate with hugging face to prove LLama2 access (reference here).

Narsil commented 1 year ago

Can you try HUGGING_FACE_HUB_TOKEN as suggested here: https://github.com/huggingface/text-generation-inference#using-a-private-or-gated-model ?

philschmid commented 1 year ago

70B is not yet supported with 0.8.2 as well. The new version should be available soon.

AlexHandy1 commented 1 year ago

Thanks @Narsil @philschmid.

So I tried the HUGGING_FACE_HUB_TOKEN parameter and still get the same error for both the 70B and 13B models. My assumption from your comment @philschmid is that this won't be resolved until 0.8.2 supports Llama2 models? Or is there something else I can try in the meantime as I'm surprised I'd still get the "authentication" specific error?

philschmid commented 1 year ago

13B should work can you please share the code you use?

AlexHandy1 commented 1 year ago

Here you go @philschmid.

Same as above, but with the HUGGING_FACE_HUB_TOKEN parameter change @Narsil suggested and meta-llama/Llama-2-13b-hf as the HF model.

# install supported sagemaker SDK
!pip install "sagemaker==2.163.0" --upgrade --quiet

import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {sess.boto_region_name}")

from sagemaker.huggingface import get_huggingface_llm_image_uri

# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
  "huggingface",
  version="0.8.2"
)

# print ecr image uri
print(f"llm image uri: {llm_image}")

import json
from sagemaker.huggingface import HuggingFaceModel

# sagemaker config
instance_type = "ml.g5.12xlarge"
number_of_gpu = 4
health_check_timeout = 500

#set with read access [include the full token in version that run]
hf_api_token = "hf_xxx"
hf_model = "meta-llama/Llama-2-13b-hf"

# TGI config
config = {
  'HF_MODEL_ID': hf_model, # model_id from hf.co/models 
  'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
  'HUGGING_FACE_HUB_TOKEN': json.dumps(hf_api_token), 
}

# create HuggingFaceModel
llm_model = HuggingFaceModel(
  role=role,
  image_uri=llm_image,
  env=config
)

# Deploy model to an endpoint
# https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy
llm = llm_model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
  container_startup_health_check_timeout=health_check_timeout,
)

philschmid commented 1 year ago

And the token you provide has access to load the llama models and you accepted the terms on the model card?

AlexHandy1 commented 1 year ago

yes (see screenshot, same for 13B too). I've tried "read" and "write" versions of tokens too, same result. Screenshot 2023-08-04 at 10 27 56