Deploy any model to Sagemaker with Quantization

I am facing problems with deployment on Sagemaker. Instance: ml.g5.2xlarge With default config this happens

Sagemaker deployment failed due to memory error
torch.cuda.OutOfMemoryError: Allocation on device 0 would exceed allowed memory. (out of memory)
Currently allocated : 20.61 GiB
Requested : 172.00 MiB
Device limit : 22.20 GiB
Free (according to CUDA): 15.12 MiB
PyTorch limit (set by user-supplied memory fraction)
: 22.20 GiB

This could be solved by loading model with `bfloat16`/`fp16` instead of `float32` to fit it into the VRAM properly.

I couldn't figure out a way to do so. I am using this script to deploy it.

import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(
        RoleName='AmazonSageMaker-ExecutionRole-20230723T133694')['Role']['Arn']

# Hub Model configuration. https://huggingface.co/models
hub = {
    'HF_MODEL_ID': 'NumbersStation/nsql-2B',
    'SM_NUM_GPUS': json.dumps(1)
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    image_uri=get_huggingface_llm_image_uri("huggingface", version="0.9.3"),
    env=hub,
    role=role,
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.2xlarge",
    container_startup_health_check_timeout=300
)

I tried few things but no luck

"hub = { HF_MODEL_ID': 'NumbersStation/nsql-llama-2-7B', SM_NUM_GPUS': json.dumps(1),"    

SM_FRAMEWORK_PARAMS': "{precision_mode: 'bf16'}", `NO`
SM_FRAMEWORK_PARAMS': '{"precision": "bfloat16"}', `NO`
SM_FRAMEWORK_PARAMS': '{"precision": "bf16"}', `NO`    
'SM_FRAMEWORK_PARAMS': "{'torch_dtype': 'torch.bfloat16'}" `NO`       
'SM_FRAMEWORK_PARAMS': "{'torch_dtype': 'bfloat16'}" `NO`

'HF_MODEL_QUANTIZE': 'bitsandbytes'    
Error: ValueError: A device map needs to be passed to run convert models into mixed-int8 format. Please run`.from_pretrained` with `device_map='auto'`
 #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m
`NO`

'HF_MODEL_QUANTIZE': 'gptq'
Error: ValueError: gptq quantization is not supported for AutoModel, you can try to quantize it with `text-generation-server quantize ORIGINAL_MODEL_ID NEW_MODEL_ID`
 #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m
 `NO`

indi4u / LLM

Deploy any model to Sagemaker with Quantization #1

This could be solved by loading model with `bfloat16`/`fp16` instead of `float32` to fit it into the VRAM properly.

I tried few things but no luck

indi4u / LLM

Deploy any model to Sagemaker with Quantization #1

This could be solved by loading model with bfloat16/fp16 instead of float32 to fit it into the VRAM properly.

I tried few things but no luck

This could be solved by loading model with `bfloat16`/`fp16` instead of `float32` to fit it into the VRAM properly.