Closed QuanhuiGuan closed 3 months ago
@QuanhuiGuan Can you share the config.json
with the quantization section here? Maybe there is a groupsize issue that is causing uneven sharding
@QuanhuiGuan Can you share the
config.json
with the quantization section here? Maybe there is a groupsize issue that is causing uneven sharding
here is the config.json file content, i used the offcial AWQ code to quantified my model. Could you give me some suggestions? @mgoin
Please tell me if you use the offcial AWQ code here, if you refer to https://github.com/vllm-project/vllm/blob/v0.5.0/examples/fp8/quantizer/quantize.py? I made an obvious error when using awq quantification. Only one file was generated and config.json could not be generated correctly. Please guide me on your awq quantification process.
This is the vLLM docs page for making AWQ models: https://docs.vllm.ai/en/latest/quantization/auto_awq.html What you linked to seem to be code within the fp8 quantizer that vLLM at least doesn't use for AWQ models.
# quant code
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer, AwqConfig
model_path = "/path to/Qwen2-7B-Instruct"
quant_path = "/path to/Qwen2-7B-Instruct-v1-awq"
quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}
model = AutoAWQForCausalLM.from_pretrained(model_path, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model.quantize(tokenizer, quant_config=quant_config)
# quantization_config = AwqConfig(
# bits=quant_config["w_bit"],
# group_size=quant_config["q_group_size"],
# zero_point=quant_config["zero_point"],
# version=quant_config["version"].lower(),
# ).to_dict()
#
# model.model.config.quantization_config = quantization_config
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
{
"_name_or_path": "/workspace/llm_models/Qwen2-72B-Instruct/",
"architectures": [
"Qwen2ForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151645,
"hidden_act": "silu",
"hidden_size": 8192,
"initializer_range": 0.02,
"intermediate_size": 29568,
"max_position_embeddings": 32768,
"max_window_layers": 70,
"model_type": "qwen2",
"num_attention_heads": 64,
"num_hidden_layers": 80,
"num_key_value_heads": 8,
"quantization_config": {
"bits": 4,
"group_size": 128,
"modules_to_not_convert": null,
"quant_method": "awq",
"version": "gemm",
"zero_point": true
},
"rms_norm_eps": 1e-06,
"rope_theta": 1000000.0,
"sliding_window": 131072,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.42.3",
"use_cache": true,
"use_sliding_window": false,
"vocab_size": 152064
}
python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 21112 \
--model /workspace/llm_models/Qwen2-72B-Instruct-awq-4-v1/ \
--served-model-name qwen-72b \
--trust-remote-code \
--max-model-len 8192 \
--chat-template /workspace/template/qwen.tpl \
--disable-custom-all-reduce \
--enforce-eager \
--tensor-parallel-size 8 \
--quantization awq
INFO 07-08 12:32:17 api_server.py:206] vLLM API server version 0.5.1
INFO 07-08 12:32:17 api_server.py:207] args: Namespace(host='0.0.0.0', port=21112, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template='/workspace/template/qwen.tpl', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/workspace/llm_models/Qwen2-72B-Instruct-awq-4-v1/', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=8192, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=8, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization='awq', rope_scaling=None, rope_theta=None, enforce_eager=True, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=True, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=['sl-qwen-72b'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
WARNING 07-08 12:32:17 config.py:244] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 07-08 12:32:17 config.py:698] Defaulting to use mp for distributed inference
INFO 07-08 12:32:17 llm_engine.py:169] Initializing an LLM engine (v0.5.1) with config: model='/workspace/llm_models/Qwen2-72B-Instruct-awq-4-v1/', speculative_config=None, tokenizer='/workspace/llm_models/Qwen2-72B-Instruct-awq-4-v1/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=awq, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=sl-qwen-72b, use_v2_block_manager=False, enable_prefix_caching=False)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
(VllmWorkerProcess pid=278168) INFO 07-08 12:32:27 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=278167) INFO 07-08 12:32:27 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=278164) INFO 07-08 12:32:27 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=278166) INFO 07-08 12:32:27 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=278165) INFO 07-08 12:32:28 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=278169) INFO 07-08 12:32:28 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=278170) INFO 07-08 12:32:28 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=278168) INFO 07-08 12:32:30 utils.py:741] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=278168) INFO 07-08 12:32:30 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 07-08 12:32:30 utils.py:741] Found nccl from library libnccl.so.2
INFO 07-08 12:32:30 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=278164) INFO 07-08 12:32:30 utils.py:741] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=278164) INFO 07-08 12:32:30 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=278165) INFO 07-08 12:32:30 utils.py:741] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=278165) INFO 07-08 12:32:30 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=278166) INFO 07-08 12:32:30 utils.py:741] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=278166) INFO 07-08 12:32:30 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=278167) INFO 07-08 12:32:30 utils.py:741] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=278167) INFO 07-08 12:32:30 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=278169) INFO 07-08 12:32:30 utils.py:741] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=278169) INFO 07-08 12:32:30 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=278170) INFO 07-08 12:32:30 utils.py:741] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=278170) INFO 07-08 12:32:30 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method load_model: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size., Traceback (most recent call last):
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226] File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226] output = executor(*args, **kwargs)
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226] File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/worker/worker.py", line 133, in load_model
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226] self.model_runner.load_model()
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226] File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 243, in load_model
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226] self.model = get_model(
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226] ^^^^^^^^^^
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226] File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226] return loader.load_model(model_config=model_config,
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226] File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 267, in load_model
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226] model = _initialize_model(model_config, self.load_config,
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226] File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 104, in _initialize_model
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226] return model_class(config=model_config.hf_config,
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226] File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 316, in __init__
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226] self.model = Qwen2Model(config, cache_config, quant_config)
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226] File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 240, in __init__
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226] self.layers = nn.ModuleList([
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226] ^
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226] File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 241, in <listcomp>
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226] Qwen2DecoderLayer(config, cache_config, quant_config)
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226] File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 183, in __init__
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226] self.mlp = Qwen2MLP(
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226] ^^^^^^^^^
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226] File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 68, in __init__
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226] self.down_proj = RowParallelLinear(intermediate_size,
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226] File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 705, in __init__
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226] self.quant_method.create_weights(
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226] File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/awq.py", line 92, in create_weights
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226] raise ValueError(
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226] ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226]
[rank0]: Traceback (most recent call last):
[rank0]: File "<frozen runpy>", line 198, in _run_module_as_main
[rank0]: File "<frozen runpy>", line 88, in _run_code
[rank0]: File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 216, in <module>
[rank0]: engine = AsyncLLMEngine.from_engine_args(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 431, in from_engine_args
[rank0]: engine = cls(
[rank0]: ^^^^
[rank0]: File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 360, in __init__
[rank0]: self.engine = self._init_engine(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 507, in _init_engine
[rank0]: return engine_class(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 243, in __init__
[rank0]: self.model_executor = executor_class(
[rank0]: ^^^^^^^^^^^^^^^
[rank0]: File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 153, in __init__
[rank0]: super().__init__(*args, **kwargs)
[rank0]: File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
[rank0]: super().__init__(*args, **kwargs)
[rank0]: File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 128, in __init__
[rank0]: super().__init__(model_config, cache_config, parallel_config,
[rank0]: File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 42, in __init__
[rank0]: self._init_executor()
[rank0]: File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 79, in _init_executor
[rank0]: self._run_workers("load_model",
[rank0]: File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 130, in _run_workers
[rank0]: driver_worker_output = driver_worker_method(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/worker/worker.py", line 133, in load_model
[rank0]: self.model_runner.load_model()
[rank0]: File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 243, in load_model
[rank0]: self.model = get_model(
[rank0]: ^^^^^^^^^^
[rank0]: File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
[rank0]: return loader.load_model(model_config=model_config,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 267, in load_model
[rank0]: model = _initialize_model(model_config, self.load_config,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 104, in _initialize_model
[rank0]: return model_class(config=model_config.hf_config,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 316, in __init__
[rank0]: self.model = Qwen2Model(config, cache_config, quant_config)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 240, in __init__
[rank0]: self.layers = nn.ModuleList([
[rank0]: ^
[rank0]: File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 241, in <listcomp>
[rank0]: Qwen2DecoderLayer(config, cache_config, quant_config)
[rank0]: File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 183, in __init__
[rank0]: self.mlp = Qwen2MLP(
[rank0]: ^^^^^^^^^
[rank0]: File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 68, in __init__
[rank0]: self.down_proj = RowParallelLinear(intermediate_size,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 705, in __init__
[rank0]: self.quant_method.create_weights(
[rank0]: File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/awq.py", line 92, in create_weights
[rank0]: raise ValueError(
[rank0]: ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.
ERROR 07-08 12:32:43 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 278169 died, exit code: -15
INFO 07-08 12:32:43 multiproc_worker_utils.py:123] Killing local vLLM worker processes
/workspace/miniconda3/envs/lf/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 2 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
vllm 0.5.1
vllm-flash-attn 2.5.9
same problem.
@chenchun0629 I am facing the same issue. Could you please let me know if you were able to solve it?
@chenchun0629 I am facing the same issue. Could you please let me know if you were able to solve it?
You can try set q_group_size to 64(2 gpus) or 32(4 gpus)
@youkaichao
Your current environment
My environment: Name: vllm Version: 0.4.2+cu117
🐛 Describe the bug
I quantified the model(Qwen2_72B) using AWQ myself, when i wanna to set api service by using two gpus it doesn't work. but using one gpu is fine, but in some situation i have to use two gpus to set this service.
Is there anyone could give me some suggestions?