Closed AlexHandy1 closed 1 year ago
Can you try HUGGING_FACE_HUB_TOKEN
as suggested here: https://github.com/huggingface/text-generation-inference#using-a-private-or-gated-model ?
70B is not yet supported with 0.8.2 as well. The new version should be available soon.
Thanks @Narsil @philschmid.
So I tried the HUGGING_FACE_HUB_TOKEN parameter and still get the same error for both the 70B and 13B models. My assumption from your comment @philschmid is that this won't be resolved until 0.8.2 supports Llama2 models? Or is there something else I can try in the meantime as I'm surprised I'd still get the "authentication" specific error?
13B should work can you please share the code you use?
Here you go @philschmid.
Same as above, but with the HUGGING_FACE_HUB_TOKEN parameter change @Narsil suggested and meta-llama/Llama-2-13b-hf as the HF model.
# install supported sagemaker SDK
!pip install "sagemaker==2.163.0" --upgrade --quiet
import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
# set to default bucket if a bucket name is not given
sagemaker_session_bucket = sess.default_bucket()
try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client('iam')
role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {sess.boto_region_name}")
from sagemaker.huggingface import get_huggingface_llm_image_uri
# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
"huggingface",
version="0.8.2"
)
# print ecr image uri
print(f"llm image uri: {llm_image}")
import json
from sagemaker.huggingface import HuggingFaceModel
# sagemaker config
instance_type = "ml.g5.12xlarge"
number_of_gpu = 4
health_check_timeout = 500
#set with read access [include the full token in version that run]
hf_api_token = "hf_xxx"
hf_model = "meta-llama/Llama-2-13b-hf"
# TGI config
config = {
'HF_MODEL_ID': hf_model, # model_id from hf.co/models
'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
'HUGGING_FACE_HUB_TOKEN': json.dumps(hf_api_token),
}
# create HuggingFaceModel
llm_model = HuggingFaceModel(
role=role,
image_uri=llm_image,
env=config
)
# Deploy model to an endpoint
# https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy
llm = llm_model.deploy(
initial_instance_count=1,
instance_type=instance_type,
container_startup_health_check_timeout=health_check_timeout,
)
And the token you provide has access to load the llama models and you accepted the terms on the model card?
yes (see screenshot, same for 13B too). I've tried "read" and "write" versions of tokens too, same result.
Yeah just like to second this issue -- have a license for Llama use, have fine tuned Llama multiple times via google colab and other training frameworks, but tried the sagemaker deployment tutorial posted by @philschmid (https://www.philschmid.de/sagemaker-llama-llm) and ran into the same authentication issue, even after providing my standard HF write token ... what's weirder is mine won't throw an error in the notebook, they'll deploy and then just be failed deployments after like 40mins of waiting around on CloudWatch and Sagemaker dashboard
I have tried w/ 7b, 13b, 70b (all chat-hf) ... can provide a code snippet but my work is identical to the methods outlined in the article above besides the credentials used
Can you share the logs of your deployments?
of course! thanks for the fast reply :)
...may have figured the issue out...testing a new deployment right now -- will report back if I figure it out 🖖
got it! remove the json.dumps() wrapper from the line 'HUGGING_FACE_HUB_TOKEN': json.dumps(hf_api_token) ...
so stupid but that creates the '"string"' effect and makes the token invalid
let me know if that solves it for you @AlexHandy1 ... @philschmid your article was totally solid, there was no reason to add the json.dumps() line (which I also caught myself doing for consistency's sake, but that actually messes it all up) ... I would print your token or do an assert check like this:
assert config['HUGGING_FACE_HUB_TOKEN'] == "
@bcarsley the json.dumps
is only needed for numbers to "stringify" them since those ARGS will be passed as CLI args and numbers are not working/caused issues.
@philschmid yes I think that json.dumps() around the HF token (i.e. doing .dumps(str)) in @AlexHandy1 original code snippet is likely causing the write token to be read incorrectly by the sagemaker deployment … I realized a similar issue in my code and changing the line to be just a string w/o json.dumps() made the deployment work !
This works for me too! Thanks @bcarsley @philschmid
System Info
AWS Sagemaker v.2.163.0
Target Sagemaker endpoint compute configuration settings
Full AWS Sagemaker notebook setup code included below in "Reproduction" on ml.t2.medium notebook and conda_pytorch_p310.
Information
Tasks
Reproduction
Code run on an AWS Sagemaker notebook where trying to deploy meta-llama/Llama-2-70b-chat-hf to an AWS Sagemaker endpoint using an HuggingFace LLM Inference Container. Based off Falcon 40B deployment code outlined by @philschmid in this blog post (Note: deployed Falcon 40B successfully using this code where no authentication required). Access approval from meta and hugging face has already been acquired and linked to same email / account.
Error message in CloudWatch logs
Expected behavior
Running the above code on an AWS Sagemaker notebook creates an AWS Sagemaker endpoint which hosts the LLama2 70b-chat model. Expecting that the 'HF_API_TOKEN' parameter will handle the requirement to authenticate with hugging face to prove LLama2 access (reference here).