unslothai / unsloth

Finetune Llama 3.1, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
15.6k stars 1.05k forks source link

Deploying llama3.1 8b instruct to sagemaker model endpoints #865

Open mleiter696 opened 1 month ago

mleiter696 commented 1 month ago
# sagemaker config
instance_type = "ml.g4dn.xlarge"
number_of_gpu = 1
health_check_timeout = 300

config = {
    "HF_MODEL_ID": "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",  # model_id from hf.co/models
    "SM_NUM_GPUS": json.dumps(number_of_gpu),  # Number of GPU used per replica
    "MAX_INPUT_LENGTH": "4096",  # Max length of input text
    "MAX_TOTAL_TOKENS": "8192",  # Max length of the generation (including input text)
    "MAX_BATCH_TOTAL_TOKENS": "8192",  # Limits the number of tokens that can be processed in parallel during the generation
}

# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
    role=role, image_uri=llm_image, env=config, sagemaker_session=sess, transformers_version="4.43.3", tensorflow_version="2.17.0", pytorch_version="2.3.1", 
)

I am trying to deploy unsloth models to sagemaker model endpoints on to a T4 GPU (very important as this is the best GPU that i can use).

I matched the transformers,tensor/pytorch version to that which is in the collab notebook.

When I try and deploy this model, I get the following error:

RuntimeError: mat1 and mat2 shapes cannot be multiplied (4146x4096 and 1x12582912)

Do I need to change the quantize option? What could be wrong here?

danielhanchen commented 1 month ago

You're using the bitsandbytes 4bit version, which I'm unsure if sagemaker supports

mleiter696 commented 1 month ago

I am pretty new to this so had no idea that second version existed...

tried unsloth/Meta-Llama-3.1-8B-Instruct instead and now have

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU

which I guess is why I needed the bnb in the first place. I am running out of ideas, Nvidia T4 is the best GPU available to me, I guess I have to try running it on an EC2 without sagemaker.

mleiter696 commented 1 month ago

I also tried

"HF_MODEL_ID": "unsloth/Meta-Llama-3.1-8B-Instruct", "HF_MODEL_QUANTIZE" : "bitsandbytes" # [possible values: awq, eetq, exl2, gptq, marlin, bitsandbytes, bitsandbytes-nf4, bitsandbytes-fp4, fp8]

but I got RuntimeError: FlashAttention only supports Ampere GPUs or newer. which is why I moved from the Meta-Llama version in the first place.

I see now that the Google Collab notebook only had the BNB version available, which I suppose is how to run that one.

I suppose my goal now is to try and make the BNB version work in sagemaker, but not sure what is missing/possible here.