Deploying llama3.1 8b instruct to sagemaker model endpoints

mleiter696 commented 3 months ago

# sagemaker config
instance_type = "ml.g4dn.xlarge"
number_of_gpu = 1
health_check_timeout = 300

config = {
    "HF_MODEL_ID": "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",  # model_id from hf.co/models
    "SM_NUM_GPUS": json.dumps(number_of_gpu),  # Number of GPU used per replica
    "MAX_INPUT_LENGTH": "4096",  # Max length of input text
    "MAX_TOTAL_TOKENS": "8192",  # Max length of the generation (including input text)
    "MAX_BATCH_TOTAL_TOKENS": "8192",  # Limits the number of tokens that can be processed in parallel during the generation
}

# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
    role=role, image_uri=llm_image, env=config, sagemaker_session=sess, transformers_version="4.43.3", tensorflow_version="2.17.0", pytorch_version="2.3.1", 
)

I am trying to deploy unsloth models to sagemaker model endpoints on to a T4 GPU (very important as this is the best GPU that i can use).

I matched the transformers,tensor/pytorch version to that which is in the collab notebook.

When I try and deploy this model, I get the following error:

RuntimeError: mat1 and mat2 shapes cannot be multiplied (4146x4096 and 1x12582912)

Do I need to change the quantize option? What could be wrong here?

danielhanchen commented 3 months ago

You're using the bitsandbytes 4bit version, which I'm unsure if sagemaker supports

mleiter696 commented 3 months ago

I am pretty new to this so had no idea that second version existed...

tried unsloth/Meta-Llama-3.1-8B-Instruct instead and now have

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU

which I guess is why I needed the bnb in the first place. I am running out of ideas, Nvidia T4 is the best GPU available to me, I guess I have to try running it on an EC2 without sagemaker.

mleiter696 commented 3 months ago

I also tried

"HF_MODEL_ID": "unsloth/Meta-Llama-3.1-8B-Instruct", "HF_MODEL_QUANTIZE" : "bitsandbytes" # [possible values: awq, eetq, exl2, gptq, marlin, bitsandbytes, bitsandbytes-nf4, bitsandbytes-fp4, fp8]

but I got RuntimeError: FlashAttention only supports Ampere GPUs or newer. which is why I moved from the Meta-Llama version in the first place.

I see now that the Google Collab notebook only had the BNB version available, which I suppose is how to run that one.

I suppose my goal now is to try and make the BNB version work in sagemaker, but not sure what is missing/possible here.

PAIMNayanDas commented 1 month ago

Hey @mleiter696 , did you find a way to make it work in SageMaker?

danielhanchen commented 1 month ago

For endpoints - maybe Llama 3.2 1/3B might fit in full 16bit precisions - I'm unsure if Amazon even supports bitsandbytes

nimishbongale commented 1 week ago

but I got RuntimeError: FlashAttention only supports Ampere GPUs or newer. which is why I moved from the Meta-Llama version in the first place.

@mleiter696 the ml.g4dn.xlarge instances have the older GPUs, you may want to switch over the the g5 or g6 family of instances (note: they do cost more)

For endpoints - maybe Llama 3.2 1/3B might fit in full 16bit precisions - I'm unsure if Amazon even supports bitsandbytes

@danielhanchen With bnb, AWS Sagemaker supports it out of the box with TGI images: https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/launcher

You can specify the quantization parameters as environment variables. I got a 4 bit bnb quantized Llama 3.1 8b to run on Sagemaker, but I noticed a significant drop in throughput as compared to the base model, only upside being it takes far less memory to run.

With unsloth, I believe the only way to run it on Sagemaker is to use a custom image – feel free to correct me if my understanding is incorrect!

unslothai / unsloth

Deploying llama3.1 8b instruct to sagemaker model endpoints #865