huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.9k stars 1.05k forks source link

Unable to deploy Llama2 70B-chat as AWS Sagemaker endpoint - HF_API_TOKEN parameter does not authenticate #751

Closed AlexHandy1 closed 1 year ago

AlexHandy1 commented 1 year ago

System Info

AWS Sagemaker v.2.163.0

Target Sagemaker endpoint compute configuration settings

Full AWS Sagemaker notebook setup code included below in "Reproduction" on ml.t2.medium notebook and conda_pytorch_p310.

Information

Tasks

Reproduction

Code run on an AWS Sagemaker notebook where trying to deploy meta-llama/Llama-2-70b-chat-hf to an AWS Sagemaker endpoint using an HuggingFace LLM Inference Container. Based off Falcon 40B deployment code outlined by @philschmid in this blog post (Note: deployed Falcon 40B successfully using this code where no authentication required). Access approval from meta and hugging face has already been acquired and linked to same email / account.

# install supported sagemaker SDK
!pip install "sagemaker==2.163.0" --upgrade --quiet

import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {sess.boto_region_name}")

from sagemaker.huggingface import get_huggingface_llm_image_uri

# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
  "huggingface",
  version="0.8.2"
)

# print ecr image uri
print(f"llm image uri: {llm_image}")

import json
from sagemaker.huggingface import HuggingFaceModel

# sagemaker config
instance_type = "ml.g5.12xlarge"
number_of_gpu = 4
health_check_timeout = 500

#set with read access [include the full token in version that run]
hf_api_token = "hf_xxx"

# TGI config
config = {
  'HF_MODEL_ID': "meta-llama/Llama-2-70b-chat-hf", # model_id from hf.co/models 
  'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
  'HF_API_TOKEN': json.dumps(hf_api_token), 
}

# create HuggingFaceModel
llm_model = HuggingFaceModel(
  role=role,
  image_uri=llm_image,
  env=config
)

# Deploy model to an endpoint
# https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy
llm = llm_model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
  container_startup_health_check_timeout=health_check_timeout,
)

Error message in CloudWatch logs 

Error: DownloadError
huggingface_hub.utils._errors.GatedRepoError: 401 Client Error.
Cannot access gated repo for url https://huggingface.co/api/models/meta-llama/Llama-2-70b-chat-hf.
Repo model meta-llama/Llama-2-70b-chat-hf is gated. You must be authenticated to access it.

Expected behavior

Running the above code on an AWS Sagemaker notebook creates an AWS Sagemaker endpoint which hosts the LLama2 70b-chat model. Expecting that the 'HF_API_TOKEN' parameter will handle the requirement to authenticate with hugging face to prove LLama2 access (reference here).

Narsil commented 1 year ago

Can you try HUGGING_FACE_HUB_TOKEN as suggested here: https://github.com/huggingface/text-generation-inference#using-a-private-or-gated-model ?

philschmid commented 1 year ago

70B is not yet supported with 0.8.2 as well. The new version should be available soon.

AlexHandy1 commented 1 year ago

Thanks @Narsil @philschmid.

So I tried the HUGGING_FACE_HUB_TOKEN parameter and still get the same error for both the 70B and 13B models. My assumption from your comment @philschmid is that this won't be resolved until 0.8.2 supports Llama2 models? Or is there something else I can try in the meantime as I'm surprised I'd still get the "authentication" specific error?

philschmid commented 1 year ago

13B should work can you please share the code you use?

AlexHandy1 commented 1 year ago

Here you go @philschmid.

Same as above, but with the HUGGING_FACE_HUB_TOKEN parameter change @Narsil suggested and meta-llama/Llama-2-13b-hf as the HF model.

# install supported sagemaker SDK
!pip install "sagemaker==2.163.0" --upgrade --quiet

import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {sess.boto_region_name}")

from sagemaker.huggingface import get_huggingface_llm_image_uri

# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
  "huggingface",
  version="0.8.2"
)

# print ecr image uri
print(f"llm image uri: {llm_image}")

import json
from sagemaker.huggingface import HuggingFaceModel

# sagemaker config
instance_type = "ml.g5.12xlarge"
number_of_gpu = 4
health_check_timeout = 500

#set with read access [include the full token in version that run]
hf_api_token = "hf_xxx"
hf_model = "meta-llama/Llama-2-13b-hf"

# TGI config
config = {
  'HF_MODEL_ID': hf_model, # model_id from hf.co/models 
  'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
  'HUGGING_FACE_HUB_TOKEN': json.dumps(hf_api_token), 
}

# create HuggingFaceModel
llm_model = HuggingFaceModel(
  role=role,
  image_uri=llm_image,
  env=config
)

# Deploy model to an endpoint
# https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy
llm = llm_model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
  container_startup_health_check_timeout=health_check_timeout,
)
philschmid commented 1 year ago

And the token you provide has access to load the llama models and you accepted the terms on the model card?

AlexHandy1 commented 1 year ago

yes (see screenshot, same for 13B too). I've tried "read" and "write" versions of tokens too, same result. Screenshot 2023-08-04 at 10 27 56

bcarsley commented 1 year ago

Yeah just like to second this issue -- have a license for Llama use, have fine tuned Llama multiple times via google colab and other training frameworks, but tried the sagemaker deployment tutorial posted by @philschmid (https://www.philschmid.de/sagemaker-llama-llm) and ran into the same authentication issue, even after providing my standard HF write token ... what's weirder is mine won't throw an error in the notebook, they'll deploy and then just be failed deployments after like 40mins of waiting around on CloudWatch and Sagemaker dashboard

bcarsley commented 1 year ago

I have tried w/ 7b, 13b, 70b (all chat-hf) ... can provide a code snippet but my work is identical to the methods outlined in the article above besides the credentials used

philschmid commented 1 year ago

Can you share the logs of your deployments?

bcarsley commented 1 year ago

Screenshot 2023-08-15 at 15 56 57 of course! thanks for the fast reply :)

bcarsley commented 1 year ago

...may have figured the issue out...testing a new deployment right now -- will report back if I figure it out 🖖

bcarsley commented 1 year ago

got it! remove the json.dumps() wrapper from the line 'HUGGING_FACE_HUB_TOKEN': json.dumps(hf_api_token) ...

so stupid but that creates the '"string"' effect and makes the token invalid

bcarsley commented 1 year ago

let me know if that solves it for you @AlexHandy1 ... @philschmid your article was totally solid, there was no reason to add the json.dumps() line (which I also caught myself doing for consistency's sake, but that actually messes it all up) ... I would print your token or do an assert check like this: assert config['HUGGING_FACE_HUB_TOKEN'] == "", "Please set your Hugging Face Hub token"

philschmid commented 1 year ago

@bcarsley the json.dumps is only needed for numbers to "stringify" them since those ARGS will be passed as CLI args and numbers are not working/caused issues.

bcarsley commented 1 year ago

@philschmid yes I think that json.dumps() around the HF token (i.e. doing .dumps(str)) in @AlexHandy1 original code snippet is likely causing the write token to be read incorrectly by the sagemaker deployment … I realized a similar issue in my code and changing the line to be just a string w/o json.dumps() made the deployment work !

AlexHandy1 commented 1 year ago

This works for me too! Thanks @bcarsley @philschmid