Open mleiter696 opened 3 months ago
You're using the bitsandbytes 4bit version, which I'm unsure if sagemaker supports
I am pretty new to this so had no idea that second version existed...
tried unsloth/Meta-Llama-3.1-8B-Instruct
instead and now have
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU
which I guess is why I needed the bnb in the first place. I am running out of ideas, Nvidia T4 is the best GPU available to me, I guess I have to try running it on an EC2 without sagemaker.
I also tried
"HF_MODEL_ID": "unsloth/Meta-Llama-3.1-8B-Instruct",
"HF_MODEL_QUANTIZE" : "bitsandbytes" # [possible values: awq, eetq, exl2, gptq, marlin, bitsandbytes, bitsandbytes-nf4, bitsandbytes-fp4, fp8]
but I got RuntimeError: FlashAttention only supports Ampere GPUs or newer.
which is why I moved from the Meta-Llama version in the first place.
I see now that the Google Collab notebook only had the BNB version available, which I suppose is how to run that one.
I suppose my goal now is to try and make the BNB version work in sagemaker, but not sure what is missing/possible here.
Hey @mleiter696 , did you find a way to make it work in SageMaker?
For endpoints - maybe Llama 3.2 1/3B might fit in full 16bit precisions - I'm unsure if Amazon even supports bitsandbytes
but I got
RuntimeError: FlashAttention only supports Ampere GPUs or newer.
which is why I moved from the Meta-Llama version in the first place.
@mleiter696 the ml.g4dn.xlarge instances have the older GPUs, you may want to switch over the the g5 or g6 family of instances (note: they do cost more)
For endpoints - maybe Llama 3.2 1/3B might fit in full 16bit precisions - I'm unsure if Amazon even supports bitsandbytes
@danielhanchen With bnb, AWS Sagemaker supports it out of the box with TGI images: https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/launcher
You can specify the quantization parameters as environment variables. I got a 4 bit bnb quantized Llama 3.1 8b to run on Sagemaker, but I noticed a significant drop in throughput as compared to the base model, only upside being it takes far less memory to run.
With unsloth, I believe the only way to run it on Sagemaker is to use a custom image – feel free to correct me if my understanding is incorrect!
I am trying to deploy unsloth models to sagemaker model endpoints on to a T4 GPU (very important as this is the best GPU that i can use).
I matched the transformers,tensor/pytorch version to that which is in the collab notebook.
When I try and deploy this model, I get the following error:
RuntimeError: mat1 and mat2 shapes cannot be multiplied (4146x4096 and 1x12582912)
Do I need to change the quantize option? What could be wrong here?