Disk issues when deploying to SageMaker

rsilveira79 commented 1 year ago

Hi Philip

First, will like to thank you a lot for the great posts you are doing about Flan T5, HF and DeepSeed, those are highly appreciated 🤗❤️.

I've downloaded the sharded heights from your repo (philschmid/flan-t5-xxl-sharded-fp16) , and I'm getting "OSError: [Errno 28] No space left on device" error when deploying to a ml.g5.8xlarge instance. I changed the code a little bit to be able to attach some more SSD memory (but looks like g5 instances don't accept that in AWS - or I'm wrong).

Here is how my code looks like: Definitions

INSTANCE_GPU = 'ml.g5.8xlarge'  
MODEL_PATH_XXL = 's3://[[my-s3-bucket]]/flan-t5-xxl/model.tar.gz'

Model Definition

huggingface_model = HuggingFaceModel(
   model_data=MODEL_PATH_XXL,  
   role=sage_role,
   transformers_version="4.6", # transformers version used
   pytorch_version="1.7", # pytorch version used
   py_version="py36", # python version of the DLC,
   vpc_config=VPC_CONFIGS,    
   name=model_name
)
container = huggingface_model.prepare_container_def(instance_type=INSTANCE_GPU)

Endpoint Configuration Definition

endpoint_config = sagemaker_session.create_endpoint_config(
    name=endpoint_config_name,
    model_name=model,
    initial_instance_count=1,
    instance_type=GPU_TYPE,
    #volume_size=root_volume_size
)

Endpoint Definition

endpoint = sagemaker_session.create_endpoint(
    endpoint_name=ENDPOINT_NAME,
    config_name=endpoint_config_name
)

Any suggestions will be greatly appreciated!

PS: I was able to deploy Flan T5 - Large with same code, but with the XXL model I got these errors.

philschmid commented 1 year ago

@rsilveira79 have you tried following the example https://github.com/philschmid/amazon-sagemaker-flan-t5-xxl/blob/main/sagemaker-notebook.ipynb step by step? Does the example work for you?

rsilveira79 commented 1 year ago

Hi @philschmid i took the weights from HF Hug and deployed on SageMaker, I was assuming the weights in hub were quantized, maybe I missed this step?

rsilveira79 commented 1 year ago

Hi @philschmid I tried the same code as in here, and I still getting:

ERROR - Failed to save the model-archive to model-path "/.sagemaker/mms/models". Check the file permissions and retry.

and also

OSError: [Errno 28] No space left on device: '/opt/ml/model/pytorch_model-00003-of-00012.bin' -> '/.sagemaker/mms/models/model/pytorch_model-00003-of-00012.bin'

Are you doing something special to add NVMe storage to your g5 GPU instances?

Thanks, Roberto

philschmid commented 1 year ago

Did you just execute the notebook? Or did you made any changes?

rsilveira79 commented 1 year ago

I did a very small change to copy model to S3 because I was having credentials issue message.

Basically I copy model like in here:

!aws s3 cp model.tar.gz s3://[my-bucket-folder]/flan-t5-xxl/

philschmid / amazon-sagemaker-flan-t5-xxl

Disk issues when deploying to SageMaker #1