philschmid / llm-sagemaker-sample

Apache License 2.0
49 stars 21 forks source link

Can not redeploy the model #2

Closed abhimasand closed 11 months ago

abhimasand commented 11 months ago

Hi @philschmid,

Thanks for making this repo, it was a huge help! I successfully trained and deployed the model to a sagemaker endpoint. However, when I deleted the endpoint when I was done with it and wanted to recreate it again, I could not do so.

For context, I manually retrieved the s3 url of my model and put it in the model s3 path.

I am unable to figure out why I am not able to deploy the model even though the s3 path is pointing to the correct location and my role has all the required permissions.

I get the following error:

ClientError: An error occurred (ValidationException) when calling the CreateModel operation: Could not access model data at /huggingface-qlora-mistralai-Mistral-7B--2023-10-06-11-27-09-016/output/model/. Please ensure that the role "" exists and that its trust relationship policy allows the action "sts:AssumeRole" for the service principal "sagemaker.amazonaws.com". Also ensure that the role has "s3:GetObject" permissions and that the object is located in eu-west-1. If your Model uses multiple models or uncompressed models, please ensure that the role has "s3:ListBucket" permission.

Truly would appreciate your help!

philschmid commented 11 months ago

Seems like you are missing permisisons.

abhimasand commented 11 months ago

I am using an admin role, and I have checked that it has all the s3 and Sagemaker permissions required. However, I will double-check that.

The part I am confused about is how it deployed the first time right after training? If it was a permission issue, it shouldn't have deployed at that time either.

philschmid commented 11 months ago

your error says role "" exists with an empty string maybe its not passed correctly

abhimasand commented 11 months ago

I apologize for the confusion. I had omitted some details from the model uri before posting.

I have solved the problem now. I realized that there was a confusing cell in the notebook.

There was a cell that had this code:

huggingface_estimator.model_data["S3DataSource"]["S3Uri"].replace("s3://", "https://s3.console.aws.amazon.com/s3/buckets/")

But this code does not change the original value of huggingface_estimator.model_data["S3DataSource"]["S3Uri"]. It only returns a new string. I was wrong to think that this code would modify the value.

The correct way was to just use the s3 uri with the prefix "s3://" as the model_s3_path instead of the URL.