unslothai / unsloth

Finetune Llama 3, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
12.28k stars 796 forks source link

AWQ support #464

Open anslin-raj opened 1 month ago

anslin-raj commented 1 month ago

I have faced an error with the VLLM framework when I tried to inferencing an Unsloth fine-tuned LLAMA3-8b model...

Error:

(venv) ubuntu@ip-192-168-68-10:~/ans/vllm-server$ python -O -u -m vllm.entrypoints.openai.api_server --host=127.0.0.1 --port=8000 --model=/home/ubuntu/ans/llama3_pipeline/fine_tuning/llama3_8b_13_05_2024/vllm_merged_4bit --tokenizer=/home/ubuntu/ans/llama3_pipeline/fine_tuning/llama3_8b_13_05_2024/vllm_merged_4bit --dtype=half INFO 05-14 09:46:09 api_server.py:151] vLLM API server version 0.4.1 INFO 05-14 09:46:09 api_server.py:152] args: Namespace(host='127.0.0.1', port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, served_model_name=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/home/ubuntu/ans/llama3_pipeline/fine_tuning/llama3_8b_13_05_2024/vllm_merged_4bit', tokenizer='/home/ubuntu/ans/llama3_pipeline/fine_tuning/llama3_8b_13_05_2024/vllm_merged_4bit', skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='half', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, model_loader_extra_config=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None) Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/ubuntu/ans/vllm-server/venv/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 159, in engine = AsyncLLMEngine.from_engine_args( File "/home/ubuntu/ans/vllm-server/venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 341, in from_engine_args engine_config = engine_args.create_engine_config() File "/home/ubuntu/ans/vllm-server/venv/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 464, in create_engine_config model_config = ModelConfig( File "/home/ubuntu/ans/vllm-server/venv/lib/python3.10/site-packages/vllm/config.py", line 115, in init self._verify_quantization() File "/home/ubuntu/ans/vllm-server/venv/lib/python3.10/site-packages/vllm/config.py", line 160, in _verify_quantization raise ValueError( ValueError: Unknown quantization method: bitsandbytes. Must be one of ['aqlm', 'awq', 'fp8', 'gptq', 'squeezellm', 'marlin'].

Code:

model, tokenizer = FastLanguageModel.from_pretrained( model_name = "meta-llama/Meta-Llama-3-8B", max_seq_length = max_seq_length, dtype = dtype, load_in_4bit = load_in_4bit, )

model = FastLanguageModel.get_peft_model( model, r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128 target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",], lora_alpha = 16, lora_dropout = 0, # Supports any, but = 0 is optimized bias = "none", # Supports any, but = "none" is optimized

[NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!

use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
random_state = 3407,
use_rslora = False,  # We support rank stabilized LoRA
loftq_config = None, # And LoftQ

)

trainer = SFTTrainer( model = model, tokenizer = tokenizer, train_dataset = dataset, dataset_text_field = "text", max_seq_length = max_seq_length, dataset_num_proc = 2, packing = False, # Can make training 5x faster for short sequences. callbacks=[RichProgressCallback], args = TrainingArguments(

num_train_epochs=1,

    per_device_train_batch_size = 2,
    gradient_accumulation_steps = 4,
    warmup_steps = 5,
    # max_steps = 2048,
    max_steps = 5,
    learning_rate = 2e-4,
    fp16 = not torch.cuda.is_bf16_supported(),
    bf16 = torch.cuda.is_bf16_supported(),
    logging_steps = 1,
    optim = "adamw_8bit",
    weight_decay = 0.01,
    lr_scheduler_type = "linear",
    seed = 3407,
    output_dir = "outputs",
    # logging_dir=f"/home/ubuntu/ans/llama3_pipeline/fine_tuning/logs",
),

)

trainer_stats = trainer.train() if True: model.save_pretrained_merged("/home/ubuntu/ans/llama3_pipeline/fine_tuning/llama3_8b_13_05_2024/vllm_merged_4bit", tokenizer, save_method="merged_4bit_forced",)

VLLM cli:

python -O -u -m vllm.entrypoints.openai.api_server --host=127.0.0.1 --port=8000 --model=/home/ubuntu/ans/llama3_pipeline/fine_tuning/llama3_8b_13_05_2024/vllm_merged_4bit --tokenizer=/home/ubuntu/ans/llama3_pipeline/fine_tuning/llama3_8b_13_05_2024/vllm_merged_4bit

Package Versions:

unsloth 2024.4 vllm 0.4.1 NVIDIA-SMI 550.67 Driver Version 550.67 CUDA Version 12.4 Python 3.10.12 torch 2.2.1

Hardware used:

Tesla T4 GPU Memory 32 GB 8 core CPU

Karry11 commented 1 month ago

https://github.com/unslothai/unsloth/issues/253 ,I think you can refer to this answer; it seems that vLLM currently only supports AWQ-4b or 8b

danielhanchen commented 1 month ago

You need to change merged_4bit_forced to merged_16bit

anslin-raj commented 1 month ago

Thanks for the response @Karry11 @danielhanchen,

I tried merged_16bit, and it required more VRAM, but I only have 16 GB VRAM, is there any other way to run the model in VLLM with 4-bit quantization method?

sparsh35 commented 1 month ago

Convert it to AWQ if want to use VLLM , other wise Unsloth inference for 4bit models

danielhanchen commented 1 month ago

Ye AWQ is nice :) We might be adding a AWQ option for exporting!

subhamiitk commented 1 month ago

What's the current best option if I have to use this 4bit finetuned model using vLLM inference- Is it to convert it to 16bit and then perform the inference?

danielhanchen commented 1 month ago

@subhamiitk Use model.save_pretrained_merged("location", tokenizer, save_method = "merged_16bit",) then use vLLM

anslin-raj commented 1 month ago

Thanks for the consideration @danielhanchen

wrisigo commented 1 week ago

vLLM's MultiLoRA deployment option + PEFT's recent feature release - training adapters on top of already AWQ quantized models opens up some really useful possibilities for inference. Mainly, budget GPU's could easily serve multiple adapters under one awq model - aka minimizing memory footprint thus pushing faster throughput.

Exporting an AWQ model is great, but I also see value in training adapters on already AWQ quantized models. Is there any desire to support this? Would be killer to have unsloth's performance boosts for this type of fine tuning.

danielhanchen commented 4 days ago

So sorry on the delay - just relocated to SF - exporting to AWQ is for now on the roadmap - directly finetuning AWQ could work as well, but will require changing fast_dequantize

anslin-raj commented 3 days ago

@danielhanchen no issues, thanks for the update... ✨

vladrad commented 1 day ago

Finetuning a AWQ image would be amazing. I see it has support for PEFT in transformers https://github.com/huggingface/transformers/pull/28987 . this would be amazing to have, it would mean everyone can just work with awq models. @danielhanchen

danielhanchen commented 1 day ago

I'll see what I can do!