Closed Jamrainbow closed 1 hour ago
and I try set tensor_parallel_size=2,use two L20,but it still do not work,bug is same
llm = LLM( model=model_name, tensor_parallel_size=2, max_model_len=4096, trust_remote_code=True, enforce_eager=True )
You can reduce max_num_seqs
(e.g. to 2) to avoid OOM.
yeah not a bug it's been asked about
Your current environment
The output of `python collect_env.py`
WARNING 11-22 07:19:14 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information. Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at https://github.com/opencv/opencv-python/issues/884 to correct your environment. The import of cv2 has been skipped. WARNING 11-22 07:19:19 config.py:395] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used INFO 11-22 07:19:19 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post1) with config: model='/home/dataset/Llama-3.2-11B-Vision-Instruct', speculative_config=None, tokenizer='/home/dataset/Llama-3.2-11B-Vision-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/dataset/Llama-3.2-11B-Vision-Instruct, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=False, mm_processor_kwargs=None) INFO 11-22 07:19:19 enc_dec_model_runner.py:141] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers. INFO 11-22 07:19:19 selector.py:115] Using XFormers backend. /usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_fwd") /usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_bwd") INFO 11-22 07:19:20 model_runner.py:1056] Starting to load model /home/dataset/Llama-3.2-11B-Vision-Instruct... INFO 11-22 07:19:20 selector.py:115] Using XFormers backend. Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00, ?it/s] Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:00<00:01, 2.97it/s] Loading safetensors checkpoint shards: 40% Completed | 2/5 [00:01<00:02, 1.24it/s] Loading safetensors checkpoint shards: 60% Completed | 3/5 [00:02<00:02, 1.01s/it] Loading safetensors checkpoint shards: 80% Completed | 4/5 [00:03<00:01, 1.09s/it] Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:05<00:00, 1.14s/it] Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:05<00:00, 1.03s/it] INFO 11-22 07:19:26 model_runner.py:1067] Loading model weights took 19.9073 GB INFO 11-22 07:19:26 enc_dec_model_runner.py:301] Starting profile run for multi-modal models. pixel_values is torch.Size([256, 1, 1, 4, 3, 560, 560]) [rank0]: Traceback (most recent call last): [rank0]: File "/home/dataset/test_vllm_models/test_llama.py", line 21, inModel Input Dumps
No response
🐛 Describe the bug
from PIL import Image from transformers import AutoTokenizer from vllm import LLM, SamplingParams from vllm.inputs import TokensPrompt
import torch import torchvision.transforms as T from torchvision.transforms.functional import InterpolationMode from vllm.assets.image import ImageAsset import argparse parser = argparse.ArgumentParser()
from decord import VideoReader, cpu
parser.add_argument('--batch_size', type=int, default=1) parser.add_argument('--input_len', type=int, default=128) parser.add_argument('--output_len', type=int, default=1024) args = parser.parse_args()
model_name = "/home/dataset/Llama-3.2-11B-Vision-Instruct"
llm = LLM( model=model_name, tensor_parallel_size=1, max_model_len=4096, trust_remote_code=True, enforce_eager=True ) sampling_params = SamplingParams(temperature=1, max_tokens=args.output_len,ignore_eos=True)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
query= 'Please describe the image in detail.' TEMPLATE = "<|im_start|>User\n{prompt}<|im_end|>\n<|im_start|>Assistant\n"
prompt = f"<|image|><|begin_of_text|>{query}\n" prompt = TEMPLATE.format(prompt=prompt)
image = Image.open("3.png")
inputs = [{ "prompt": prompt, "multi_modaldata": { "image": image, }, } for in range(args.batch_size)]
print("**** model generate begin **") import time start = time.time()
outputs = llm.generate( inputs, sampling_params=sampling_params ) cost = time.time()-start print("model total cost is ",cost)
for o in outputs: generated_text = o.outputs[0].text print(generated_text)
Before submitting a new issue...