Open rsilveira79 opened 1 year ago
@rsilveira79 have you tried following the example https://github.com/philschmid/amazon-sagemaker-flan-t5-xxl/blob/main/sagemaker-notebook.ipynb step by step? Does the example work for you?
Hi @philschmid i took the weights from HF Hug and deployed on SageMaker, I was assuming the weights in hub were quantized, maybe I missed this step?
Hi @philschmid I tried the same code as in here, and I still getting:
ERROR - Failed to save the model-archive to model-path "/.sagemaker/mms/models". Check the file permissions and retry.
and also
OSError: [Errno 28] No space left on device: '/opt/ml/model/pytorch_model-00003-of-00012.bin' -> '/.sagemaker/mms/models/model/pytorch_model-00003-of-00012.bin'
Are you doing something special to add NVMe storage to your g5 GPU instances?
Thanks, Roberto
Did you just execute the notebook? Or did you made any changes?
I did a very small change to copy model to S3 because I was having credentials issue message.
Basically I copy model like in here:
!aws s3 cp model.tar.gz s3://[my-bucket-folder]/flan-t5-xxl/
Hi Philip
First, will like to thank you a lot for the great posts you are doing about Flan T5, HF and DeepSeed, those are highly appreciated 🤗❤️.
I've downloaded the sharded heights from your repo (philschmid/flan-t5-xxl-sharded-fp16) , and I'm getting "OSError: [Errno 28] No space left on device" error when deploying to a
ml.g5.8xlarge
instance. I changed the code a little bit to be able to attach some more SSD memory (but looks like g5 instances don't accept that in AWS - or I'm wrong).Here is how my code looks like: Definitions
Model Definition
Endpoint Configuration Definition
Endpoint Definition
Any suggestions will be greatly appreciated!
PS: I was able to deploy
Flan T5 - Large
with same code, but with the XXL model I got these errors.