Open anslin-raj opened 1 month ago
https://github.com/unslothai/unsloth/issues/253 ,I think you can refer to this answer; it seems that vLLM currently only supports AWQ-4b or 8b
You need to change merged_4bit_forced
to merged_16bit
Thanks for the response @Karry11 @danielhanchen,
I tried merged_16bit, and it required more VRAM, but I only have 16 GB VRAM, is there any other way to run the model in VLLM with 4-bit quantization method?
Convert it to AWQ if want to use VLLM , other wise Unsloth inference for 4bit models
Ye AWQ is nice :) We might be adding a AWQ option for exporting!
What's the current best option if I have to use this 4bit finetuned model using vLLM inference- Is it to convert it to 16bit and then perform the inference?
@subhamiitk Use model.save_pretrained_merged("location", tokenizer, save_method = "merged_16bit",)
then use vLLM
Thanks for the consideration @danielhanchen
vLLM's MultiLoRA deployment option + PEFT's recent feature release - training adapters on top of already AWQ quantized models opens up some really useful possibilities for inference. Mainly, budget GPU's could easily serve multiple adapters under one awq model - aka minimizing memory footprint thus pushing faster throughput.
Exporting an AWQ model is great, but I also see value in training adapters on already AWQ quantized models. Is there any desire to support this? Would be killer to have unsloth's performance boosts for this type of fine tuning.
So sorry on the delay - just relocated to SF - exporting to AWQ is for now on the roadmap - directly finetuning AWQ could work as well, but will require changing fast_dequantize
@danielhanchen no issues, thanks for the update... ✨
Finetuning a AWQ image would be amazing. I see it has support for PEFT in transformers https://github.com/huggingface/transformers/pull/28987 . this would be amazing to have, it would mean everyone can just work with awq models. @danielhanchen
I'll see what I can do!
I have faced an error with the VLLM framework when I tried to inferencing an Unsloth fine-tuned LLAMA3-8b model...
Error:
(venv) ubuntu@ip-192-168-68-10:~/ans/vllm-server$ python -O -u -m vllm.entrypoints.openai.api_server --host=127.0.0.1 --port=8000 --model=/home/ubuntu/ans/llama3_pipeline/fine_tuning/llama3_8b_13_05_2024/vllm_merged_4bit --tokenizer=/home/ubuntu/ans/llama3_pipeline/fine_tuning/llama3_8b_13_05_2024/vllm_merged_4bit --dtype=half INFO 05-14 09:46:09 api_server.py:151] vLLM API server version 0.4.1 INFO 05-14 09:46:09 api_server.py:152] args: Namespace(host='127.0.0.1', port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, served_model_name=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/home/ubuntu/ans/llama3_pipeline/fine_tuning/llama3_8b_13_05_2024/vllm_merged_4bit', tokenizer='/home/ubuntu/ans/llama3_pipeline/fine_tuning/llama3_8b_13_05_2024/vllm_merged_4bit', skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='half', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, model_loader_extra_config=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None) Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/ubuntu/ans/vllm-server/venv/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 159, in
engine = AsyncLLMEngine.from_engine_args(
File "/home/ubuntu/ans/vllm-server/venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 341, in from_engine_args
engine_config = engine_args.create_engine_config()
File "/home/ubuntu/ans/vllm-server/venv/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 464, in create_engine_config
model_config = ModelConfig(
File "/home/ubuntu/ans/vllm-server/venv/lib/python3.10/site-packages/vllm/config.py", line 115, in init
self._verify_quantization()
File "/home/ubuntu/ans/vllm-server/venv/lib/python3.10/site-packages/vllm/config.py", line 160, in _verify_quantization
raise ValueError(
ValueError: Unknown quantization method: bitsandbytes. Must be one of ['aqlm', 'awq', 'fp8', 'gptq', 'squeezellm', 'marlin'].
Code:
model, tokenizer = FastLanguageModel.from_pretrained( model_name = "meta-llama/Meta-Llama-3-8B", max_seq_length = max_seq_length, dtype = dtype, load_in_4bit = load_in_4bit, )
model = FastLanguageModel.get_peft_model( model, r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128 target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",], lora_alpha = 16, lora_dropout = 0, # Supports any, but = 0 is optimized bias = "none", # Supports any, but = "none" is optimized
[NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
)
trainer = SFTTrainer( model = model, tokenizer = tokenizer, train_dataset = dataset, dataset_text_field = "text", max_seq_length = max_seq_length, dataset_num_proc = 2, packing = False, # Can make training 5x faster for short sequences. callbacks=[RichProgressCallback], args = TrainingArguments(
num_train_epochs=1,
)
trainer_stats = trainer.train() if True: model.save_pretrained_merged("/home/ubuntu/ans/llama3_pipeline/fine_tuning/llama3_8b_13_05_2024/vllm_merged_4bit", tokenizer, save_method="merged_4bit_forced",)
VLLM cli:
python -O -u -m vllm.entrypoints.openai.api_server --host=127.0.0.1 --port=8000 --model=/home/ubuntu/ans/llama3_pipeline/fine_tuning/llama3_8b_13_05_2024/vllm_merged_4bit --tokenizer=/home/ubuntu/ans/llama3_pipeline/fine_tuning/llama3_8b_13_05_2024/vllm_merged_4bit
Package Versions:
unsloth 2024.4 vllm 0.4.1 NVIDIA-SMI 550.67 Driver Version 550.67 CUDA Version 12.4 Python 3.10.12 torch 2.2.1
Hardware used:
Tesla T4 GPU Memory 32 GB 8 core CPU