Open KohakuBlueleaf opened 2 days ago
cc @mgoin @dsikka
Hi @KohakuBlueleaf it seems Pixtral has a transformers implementation as of a few days ago with https://github.com/huggingface/transformers/pull/33449, have you given it a try?
Hi @KohakuBlueleaf it seems Pixtral has a transformers implementation as of a few days ago with huggingface/transformers#33449, have you given it a try?
Is there any official huggingface repo for that? Also I'm wondering if vllm's pixtral implementation can read the transformers version of pixtral it seems like the pixtral in vllm is designed for official repo which have different file structure than transformers one?
will try this: https://huggingface.co/leafspark/Pixtral-12B-2409-hf
I am going to try this one: https://huggingface.co/mistral-community/pixtral-12b
Thanks all for ideas! I am also wondering if we can load local images instead of URLs for Pixtral? I would like to do a batch local evaluation. TIA!
Thanks all for ideas! I am also wondering if we can load local images instead of URLs for Pixtral? I would like to do a batch local evaluation. TIA!
You can just convert local image to data uri with base64
So the issue at the moment is the mismatch between how we implemented the model in the "proper Mistral" format for vLLM and how Transformers implemented it essentially within Llava. I think the most simple way forward would be to "duplicate" our vLLM implementation by adding in the Transformers configuration. With that made, then we would be able to produce quantized models through LLM Compressor and load them into vLLM
from vllm import LLM
from vllm.sampling_params import SamplingParams
import torch
MODEL_NAME = "mistral-community/pixtral-12b"
sampling_params = SamplingParams(max_tokens=100, temperature=0.0)
context_length = 2000
num_device = 2
llm = LLM(model=MODEL_NAME, speculative_max_model_len =context_length ,max_seq_len_to_capture=context_length,max_model_len=context_length
, tensor_parallel_size=num_device,trust_remote_code=True ,worker_use_ray=num_device,dtype=torch.float16
, enable_chunked_prefill=True
,gpu_memory_utilization = 0.99
, enforce_eager=True
,max_num_batched_tokens=context_length
)
prompt = "Describe this image in one sentence."
image_path = "/kaggle/working/dubu.png" # Update the path to dubu.png
messages = [
{
"role":
"user",
"content": [
{
"type": "text",
"text": prompt
},
{
"type": "image_url",
"image_url": {
"url": image_path # Using the local path for dubu.png
}
},
],
},
]
outputs = llm.chat(messages, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)
WARNING 09-18 22:39:17 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
2024-09-18 22:39:23,218 INFO util.py:124 -- Outdated packages:
ipywidgets==7.7.1 found, needs ipywidgets>=8
Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
INFO 09-18 22:39:24 config.py:1653] Downcasting torch.float32 to torch.float16.
INFO 09-18 22:39:24 config.py:1013] Chunked prefill is enabled with max_num_batched_tokens=2000.
WARNING 09-18 22:39:24 config.py:383] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
/opt/conda/lib/python3.10/subprocess.py:1796: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
self.pid = _posixsubprocess.fork_exec(
/opt/conda/lib/python3.10/subprocess.py:1796: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
self.pid = _posixsubprocess.fork_exec(
2024-09-18 22:39:27,085 INFO worker.py:1786 -- Started a local Ray instance.
INFO 09-18 22:39:28 llm_engine.py:223] Initializing an LLM engine (v0.6.1.post2) with config: model='mistral-community/pixtral-12b', speculative_config=None, tokenizer='mistral-community/pixtral-12b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=2000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=mistral-community/pixtral-12b, use_v2_block_manager=False, num_scheduler_steps=1, enable_prefix_caching=False, use_async_output_proc=False)
INFO 09-18 22:39:29 ray_gpu_executor.py:134] use_ray_spmd_worker: False
(pid=1855) WARNING 09-18 22:39:33 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
(pid=1900) WARNING 09-18 22:39:43 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
INFO 09-18 22:39:50 selector.py:217] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 09-18 22:39:50 selector.py:116] Using XFormers backend.
(RayWorkerWrapper pid=1900) INFO 09-18 22:39:50 selector.py:217] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(RayWorkerWrapper pid=1900) INFO 09-18 22:39:50 selector.py:116] Using XFormers backend.
(RayWorkerWrapper pid=1900) /opt/conda/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(RayWorkerWrapper pid=1900) @torch.library.impl_abstract("xformers_flash::flash_fwd")
/opt/conda/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_fwd")
(RayWorkerWrapper pid=1900) /opt/conda/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(RayWorkerWrapper pid=1900) @torch.library.impl_abstract("xformers_flash::flash_bwd")
/opt/conda/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 09-18 22:39:53 utils.py:981] Found nccl from library libnccl.so.2
INFO 09-18 22:39:53 pynccl.py:63] vLLM is using nccl==2.20.5
(RayWorkerWrapper pid=1900) INFO 09-18 22:39:53 utils.py:981] Found nccl from library libnccl.so.2
(RayWorkerWrapper pid=1900) INFO 09-18 22:39:53 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 09-18 22:39:53 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
WARNING 09-18 22:39:53 custom_all_reduce.py:131] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 09-18 22:39:53 shm_broadcast.py:235] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7d0b3c61a080>, local_subscribe_port=42163, remote_subscribe_port=None)
INFO 09-18 22:39:53 model_runner.py:997] Starting to load model mistral-community/pixtral-12b...
(RayWorkerWrapper pid=1900) INFO 09-18 22:39:53 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(RayWorkerWrapper pid=1900) WARNING 09-18 22:39:53 custom_all_reduce.py:131] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(RayWorkerWrapper pid=1900) INFO 09-18 22:39:53 model_runner.py:997] Starting to load model mistral-community/pixtral-12b...
(RayWorkerWrapper pid=1900) ERROR 09-18 22:39:54 worker_base.py:464] Error executing method load_model. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=1900) ERROR 09-18 22:39:54 worker_base.py:464] Traceback (most recent call last):
(RayWorkerWrapper pid=1900) ERROR 09-18 22:39:54 worker_base.py:464] File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 456, in execute_method
(RayWorkerWrapper pid=1900) ERROR 09-18 22:39:54 worker_base.py:464] return executor(*args, **kwargs)
(RayWorkerWrapper pid=1900) ERROR 09-18 22:39:54 worker_base.py:464] File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 183, in load_model
(RayWorkerWrapper pid=1900) ERROR 09-18 22:39:54 worker_base.py:464] self.model_runner.load_model()
(RayWorkerWrapper pid=1900) ERROR 09-18 22:39:54 worker_base.py:464] File "/opt/conda/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 999, in load_model
(RayWorkerWrapper pid=1900) ERROR 09-18 22:39:54 worker_base.py:464] self.model = get_model(model_config=self.model_config,
(RayWorkerWrapper pid=1900) ERROR 09-18 22:39:54 worker_base.py:464] File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/__init__.py", line 19, in get_model
(RayWorkerWrapper pid=1900) ERROR 09-18 22:39:54 worker_base.py:464] return loader.load_model(model_config=model_config,
(RayWorkerWrapper pid=1900) ERROR 09-18 22:39:54 worker_base.py:464] File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 358, in load_model
(RayWorkerWrapper pid=1900) ERROR 09-18 22:39:54 worker_base.py:464] model = _initialize_model(model_config, self.load_config,
(RayWorkerWrapper pid=1900) ERROR 09-18 22:39:54 worker_base.py:464] File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 172, in _initialize_model
(RayWorkerWrapper pid=1900) ERROR 09-18 22:39:54 worker_base.py:464] return build_model(
(RayWorkerWrapper pid=1900) ERROR 09-18 22:39:54 worker_base.py:464] File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 157, in build_model
(RayWorkerWrapper pid=1900) ERROR 09-18 22:39:54 worker_base.py:464] return model_class(config=hf_config,
(RayWorkerWrapper pid=1900) ERROR 09-18 22:39:54 worker_base.py:464] File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/llava.py", line 215, in __init__
(RayWorkerWrapper pid=1900) ERROR 09-18 22:39:54 worker_base.py:464] self.vision_tower = _init_vision_tower(config)
(RayWorkerWrapper pid=1900) ERROR 09-18 22:39:54 worker_base.py:464] File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/llava.py", line 195, in _init_vision_tower
(RayWorkerWrapper pid=1900) ERROR 09-18 22:39:54 worker_base.py:464] raise NotImplementedError(msg)
(RayWorkerWrapper pid=1900) ERROR 09-18 22:39:54 worker_base.py:464] NotImplementedError: Unsupported vision config: <class 'transformers.models.pixtral.configuration_pixtral.PixtralVisionConfig'>
ERROR 09-18 22:39:54 worker_base.py:464] Error executing method load_model. This might cause deadlock in distributed execution.
ERROR 09-18 22:39:54 worker_base.py:464] Traceback (most recent call last):
ERROR 09-18 22:39:54 worker_base.py:464] File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 456, in execute_method
ERROR 09-18 22:39:54 worker_base.py:464] return executor(*args, **kwargs)
ERROR 09-18 22:39:54 worker_base.py:464] File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 183, in load_model
ERROR 09-18 22:39:54 worker_base.py:464] self.model_runner.load_model()
ERROR 09-18 22:39:54 worker_base.py:464] File "/opt/conda/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 999, in load_model
ERROR 09-18 22:39:54 worker_base.py:464] self.model = get_model(model_config=self.model_config,
ERROR 09-18 22:39:54 worker_base.py:464] File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/__init__.py", line 19, in get_model
ERROR 09-18 22:39:54 worker_base.py:464] return loader.load_model(model_config=model_config,
ERROR 09-18 22:39:54 worker_base.py:464] File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 358, in load_model
ERROR 09-18 22:39:54 worker_base.py:464] model = _initialize_model(model_config, self.load_config,
ERROR 09-18 22:39:54 worker_base.py:464] File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 172, in _initialize_model
ERROR 09-18 22:39:54 worker_base.py:464] return build_model(
ERROR 09-18 22:39:54 worker_base.py:464] File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 157, in build_model
ERROR 09-18 22:39:54 worker_base.py:464] return model_class(config=hf_config,
ERROR 09-18 22:39:54 worker_base.py:464] File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/llava.py", line 215, in __init__
ERROR 09-18 22:39:54 worker_base.py:464] self.vision_tower = _init_vision_tower(config)
ERROR 09-18 22:39:54 worker_base.py:464] File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/llava.py", line 195, in _init_vision_tower
ERROR 09-18 22:39:54 worker_base.py:464] raise NotImplementedError(msg)
ERROR 09-18 22:39:54 worker_base.py:464] NotImplementedError: Unsupported vision config: <class 'transformers.models.pixtral.configuration_pixtral.PixtralVisionConfig'>
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
Cell In[1], line 10
8 context_length = 2000
9 num_device = 2
---> 10 llm = LLM(model=MODEL_NAME, speculative_max_model_len =context_length ,max_seq_len_to_capture=context_length,max_model_len=context_length
11 , tensor_parallel_size=num_device,trust_remote_code=True ,worker_use_ray=num_device,dtype=torch.float16
12 , enable_chunked_prefill=True
13 ,gpu_memory_utilization = 0.99
14 , enforce_eager=True
15 ,max_num_batched_tokens=context_length
16 )
19 prompt = "Describe this image in one sentence."
20 image_path = "/kaggle/working/dubu.png" # Update the path to dubu.png
File /opt/conda/lib/python3.10/site-packages/vllm/entrypoints/llm.py:178, in LLM.__init__(self, model, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, cpu_offload_gb, enforce_eager, max_context_len_to_capture, max_seq_len_to_capture, disable_custom_all_reduce, disable_async_output_proc, **kwargs)
154 raise TypeError(
155 "There is no need to pass vision-related arguments anymore.")
156 engine_args = EngineArgs(
157 model=model,
158 tokenizer=tokenizer,
(...)
176 **kwargs,
177 )
--> 178 self.llm_engine = LLMEngine.from_engine_args(
179 engine_args, usage_context=UsageContext.LLM_CLASS)
180 self.request_counter = Counter()
File /opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py:550, in LLMEngine.from_engine_args(cls, engine_args, usage_context, stat_loggers)
548 executor_class = cls._get_executor_cls(engine_config)
549 # Create the LLM engine.
--> 550 engine = cls(
551 **engine_config.to_dict(),
552 executor_class=executor_class,
553 log_stats=not engine_args.disable_log_stats,
554 usage_context=usage_context,
555 stat_loggers=stat_loggers,
556 )
558 return engine
File /opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py:317, in LLMEngine.__init__(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, speculative_config, decoding_config, observability_config, prompt_adapter_config, executor_class, log_stats, usage_context, stat_loggers, input_registry)
313 self.input_registry = input_registry
314 self.input_processor = input_registry.create_input_processor(
315 model_config)
--> 317 self.model_executor = executor_class(
318 model_config=model_config,
319 cache_config=cache_config,
320 parallel_config=parallel_config,
321 scheduler_config=scheduler_config,
322 device_config=device_config,
323 lora_config=lora_config,
324 speculative_config=speculative_config,
325 load_config=load_config,
326 prompt_adapter_config=prompt_adapter_config,
327 observability_config=self.observability_config,
328 )
330 if not self.model_config.embedding_mode:
331 self._initialize_kv_caches()
File /opt/conda/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py:26, in DistributedGPUExecutor.__init__(self, *args, **kwargs)
22 # Updated by implementations that require additional args to be passed
23 # to the _run_workers execute_model call
24 self.extra_execute_model_run_workers_kwargs: Dict[str, Any] = {}
---> 26 super().__init__(*args, **kwargs)
File /opt/conda/lib/python3.10/site-packages/vllm/executor/executor_base.py:47, in ExecutorBase.__init__(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, speculative_config, prompt_adapter_config, observability_config)
45 self.prompt_adapter_config = prompt_adapter_config
46 self.observability_config = observability_config
---> 47 self._init_executor()
File /opt/conda/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py:65, in RayGPUExecutor._init_executor(self)
62 os.environ["RAY_USAGE_STATS_ENABLED"] = "0"
64 # Create the parallel GPU workers.
---> 65 self._init_workers_ray(placement_group)
67 self.input_encoder = msgspec.msgpack.Encoder(enc_hook=encode_hook)
68 self.output_decoder = msgspec.msgpack.Decoder(
69 Optional[List[SamplerOutput]])
File /opt/conda/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py:280, in RayGPUExecutor._init_workers_ray(self, placement_group, **ray_remote_kwargs)
277 self._run_workers("init_worker", all_kwargs=init_worker_all_kwargs)
279 self._run_workers("init_device")
--> 280 self._run_workers("load_model",
281 max_concurrent_workers=self.parallel_config.
282 max_parallel_loading_workers)
284 if self.use_ray_spmd_worker:
285 for pp_rank in range(self.parallel_config.pipeline_parallel_size):
File /opt/conda/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py:411, in RayGPUExecutor._run_workers(self, method, async_run_tensor_parallel_workers_only, all_args, all_kwargs, use_dummy_driver, max_concurrent_workers, *args, **kwargs)
408 # Start the driver worker after all the ray workers.
409 if not use_dummy_driver:
410 driver_worker_output = [
--> 411 self.driver_worker.execute_method(method, *driver_args,
412 **driver_kwargs)
413 ]
414 else:
415 assert self.driver_dummy_worker is not None
File /opt/conda/lib/python3.10/site-packages/vllm/worker/worker_base.py:465, in WorkerWrapperBase.execute_method(self, method, *args, **kwargs)
462 msg = (f"Error executing method {method}. "
463 "This might cause deadlock in distributed execution.")
464 logger.exception(msg)
--> 465 raise e
File /opt/conda/lib/python3.10/site-packages/vllm/worker/worker_base.py:456, in WorkerWrapperBase.execute_method(self, method, *args, **kwargs)
454 target = self if self.worker is None else self.worker
455 executor = getattr(target, method)
--> 456 return executor(*args, **kwargs)
457 except Exception as e:
458 # if the driver worker also execute methods,
459 # exceptions in the rest worker may cause deadlock in rpc like ray
460 # see https://github.com/vllm-project/vllm/issues/3455
461 # print the error and inform the user to solve the error
462 msg = (f"Error executing method {method}. "
463 "This might cause deadlock in distributed execution.")
File /opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py:183, in Worker.load_model(self)
182 def load_model(self):
--> 183 self.model_runner.load_model()
File /opt/conda/lib/python3.10/site-packages/vllm/worker/model_runner.py:999, in GPUModelRunnerBase.load_model(self)
997 logger.info("Starting to load model %s...", self.model_config.model)
998 with CudaMemoryProfiler() as m:
--> 999 self.model = get_model(model_config=self.model_config,
1000 device_config=self.device_config,
1001 load_config=self.load_config,
1002 lora_config=self.lora_config,
1003 parallel_config=self.parallel_config,
1004 scheduler_config=self.scheduler_config,
1005 cache_config=self.cache_config)
1007 self.model_memory_usage = m.consumed_memory
1008 logger.info("Loading model weights took %.4f GB",
1009 self.model_memory_usage / float(2**30))
File /opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/__init__.py:19, in get_model(model_config, load_config, device_config, parallel_config, scheduler_config, lora_config, cache_config)
13 def get_model(*, model_config: ModelConfig, load_config: LoadConfig,
14 device_config: DeviceConfig, parallel_config: ParallelConfig,
15 scheduler_config: SchedulerConfig,
16 lora_config: Optional[LoRAConfig],
17 cache_config: CacheConfig) -> nn.Module:
18 loader = get_model_loader(load_config)
---> 19 return loader.load_model(model_config=model_config,
20 device_config=device_config,
21 lora_config=lora_config,
22 parallel_config=parallel_config,
23 scheduler_config=scheduler_config,
24 cache_config=cache_config)
File /opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py:358, in DefaultModelLoader.load_model(self, model_config, device_config, lora_config, parallel_config, scheduler_config, cache_config)
356 with set_default_torch_dtype(model_config.dtype):
357 with target_device:
--> 358 model = _initialize_model(model_config, self.load_config,
359 lora_config, cache_config,
360 scheduler_config)
361 model.load_weights(
362 self._get_weights_iterator(model_config.model,
363 model_config.revision,
(...)
366 "fall_back_to_pt_during_load",
367 True)), )
369 for _, module in model.named_modules():
File /opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py:172, in _initialize_model(model_config, load_config, lora_config, cache_config, scheduler_config)
169 """Initialize a model with the given configurations."""
170 model_class, _ = get_model_architecture(model_config)
--> 172 return build_model(
173 model_class,
174 model_config.hf_config,
175 cache_config=cache_config,
176 quant_config=_get_quantization_config(model_config, load_config),
177 lora_config=lora_config,
178 multimodal_config=model_config.multimodal_config,
179 scheduler_config=scheduler_config,
180 )
File /opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py:157, in build_model(model_class, hf_config, cache_config, quant_config, lora_config, multimodal_config, scheduler_config)
147 def build_model(model_class: Type[nn.Module], hf_config: PretrainedConfig,
148 cache_config: Optional[CacheConfig],
149 quant_config: Optional[QuantizationConfig], *,
150 lora_config: Optional[LoRAConfig],
151 multimodal_config: Optional[MultiModalConfig],
152 scheduler_config: Optional[SchedulerConfig]) -> nn.Module:
153 extra_kwargs = _get_model_initialization_kwargs(model_class, lora_config,
154 multimodal_config,
155 scheduler_config)
--> 157 return model_class(config=hf_config,
158 cache_config=cache_config,
159 quant_config=quant_config,
160 **extra_kwargs)
File /opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/llava.py:215, in LlavaForConditionalGeneration.__init__(self, config, multimodal_config, cache_config, quant_config)
212 self.multimodal_config = multimodal_config
214 # TODO: Optionally initializes this for supporting embeddings.
--> 215 self.vision_tower = _init_vision_tower(config)
216 self.multi_modal_projector = LlavaMultiModalProjector(
217 vision_hidden_size=config.vision_config.hidden_size,
218 text_hidden_size=config.text_config.hidden_size,
219 projector_hidden_act=config.projector_hidden_act)
221 self.language_model = init_vllm_registered_model(
222 config.text_config, cache_config, quant_config)
File /opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/llava.py:195, in _init_vision_tower(hf_config)
189 return SiglipVisionModel(
190 vision_config,
191 num_hidden_layers_override=num_hidden_layers,
192 )
194 msg = f"Unsupported vision config: {type(vision_config)}"
--> 195 raise NotImplementedError(msg)
NotImplementedError: Unsupported vision config: <class 'transformers.models.pixtral.configuration_pixtral.PixtralVisionConfig'>
2024-09-18 22:39:59,579 ERROR worker.py:409 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::RayWorkerWrapper.execute_method() (pid=1900, ip=172.19.2.2, actor_id=7baaf6bf819cd8dd6cc28cc101000000, repr=<vllm.executor.ray_utils.RayWorkerWrapper object at 0x7f0217cd22c0>)
File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 465, in execute_method
raise e
File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 456, in execute_method
return executor(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 183, in load_model
self.model_runner.load_model()
File "/opt/conda/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 999, in load_model
self.model = get_model(model_config=self.model_config,
File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/__init__.py", line 19, in get_model
return loader.load_model(model_config=model_config,
File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 358, in load_model
model = _initialize_model(model_config, self.load_config,
File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 172, in _initialize_model
return build_model(
File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 157, in build_model
return model_class(config=hf_config,
File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/llava.py", line 215, in __init__
self.vision_tower = _init_vision_tower(config)
File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/llava.py", line 195, in _init_vision_tower
raise NotImplementedError(msg)
NotImplementedError: Unsupported vision config: <class 'transformers.models.pixtral.configuration_pixtral.PixtralVisionConfig'>
add Codeadd Markdown
I have also tried with https://huggingface.co/leafspark/Pixtral-12B-2409-hf
still getting different error
Mine is a little different:
/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_fwd")
/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_bwd")
Process SpawnProcess-1:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 236, in run_rpc_server
server = AsyncEngineRPCServer(async_engine_args, usage_context, rpc_path)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 34, in __init__
self.engine = AsyncLLMEngine.from_engine_args(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 573, in from_engine_args
engine = cls(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 473, in __init__
self.engine = self._engine_class(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 257, in __init__
super().__init__(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 317, in __init__
self.model_executor = executor_class(
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 47, in __init__
self._init_executor()
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 40, in _init_executor
self.driver_worker.load_model()
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 183, in load_model
self.model_runner.load_model()
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 999, in load_model
self.model = get_model(model_config=self.model_config,
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 19, in get_model
return loader.load_model(model_config=model_config,
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 358, in load_model
model = _initialize_model(model_config, self.load_config,
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 172, in _initialize_model
return build_model(
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 157, in build_model
return model_class(config=hf_config,
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/pixtral.py", line 148, in __init__
for key, value in self.config.vision_config.to_dict().items()
File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 264, in __getattribute__
return super().__getattribute__(key)
AttributeError: 'MistralConfig' object has no attribute 'vision_config'
ERROR 09-19 19:09:28 api_server.py:188] RPCServer process died before responding to readiness probe
🚀 The feature, motivation and pitch
In Linux, nvidia driver doesn't provide "shared memory" like windows, which make it impossible to load Pixtral 12B into 3090 or 4090.
And since it looks like we don't have any transformers implementation of pixtral, we can only use vllm codebase to load the model.
Is it possible that vllm provide an option/API to create offline FP8 quantization through vllm model loader?
Alternatives
Although I suggest a new feature like "making offline quantize through vllm library" If vllm/mistral team can provide offline fp8 ckpt directly it is also good for me.
Additional context
No response
Before submitting a new issue...