vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.06k stars 3.97k forks source link

[Feature]: Offline quantization for Pixtral-12B #8566

Open KohakuBlueleaf opened 2 days ago

KohakuBlueleaf commented 2 days ago

🚀 The feature, motivation and pitch

In Linux, nvidia driver doesn't provide "shared memory" like windows, which make it impossible to load Pixtral 12B into 3090 or 4090.

And since it looks like we don't have any transformers implementation of pixtral, we can only use vllm codebase to load the model.

Is it possible that vllm provide an option/API to create offline FP8 quantization through vllm model loader?

Alternatives

Although I suggest a new feature like "making offline quantize through vllm library" If vllm/mistral team can provide offline fp8 ckpt directly it is also good for me.

Additional context

No response

Before submitting a new issue...

robertgshaw2-neuralmagic commented 2 days ago

cc @mgoin @dsikka

mgoin commented 2 days ago

Hi @KohakuBlueleaf it seems Pixtral has a transformers implementation as of a few days ago with https://github.com/huggingface/transformers/pull/33449, have you given it a try?

KohakuBlueleaf commented 2 days ago

Hi @KohakuBlueleaf it seems Pixtral has a transformers implementation as of a few days ago with huggingface/transformers#33449, have you given it a try?

Is there any official huggingface repo for that? Also I'm wondering if vllm's pixtral implementation can read the transformers version of pixtral it seems like the pixtral in vllm is designed for official repo which have different file structure than transformers one?

KohakuBlueleaf commented 2 days ago

will try this: https://huggingface.co/leafspark/Pixtral-12B-2409-hf

mgoin commented 1 day ago

I am going to try this one: https://huggingface.co/mistral-community/pixtral-12b

suikei-wang commented 1 day ago

Thanks all for ideas! I am also wondering if we can load local images instead of URLs for Pixtral? I would like to do a batch local evaluation. TIA!

KohakuBlueleaf commented 1 day ago

Thanks all for ideas! I am also wondering if we can load local images instead of URLs for Pixtral? I would like to do a batch local evaluation. TIA!

You can just convert local image to data uri with base64

mgoin commented 1 day ago

So the issue at the moment is the mismatch between how we implemented the model in the "proper Mistral" format for vLLM and how Transformers implemented it essentially within Llava. I think the most simple way forward would be to "duplicate" our vLLM implementation by adding in the Transformers configuration. With that made, then we would be able to produce quantized models through LLM Compressor and load them into vLLM

dahwin commented 1 day ago

why i'm getting this error?

from vllm import LLM
from vllm.sampling_params import SamplingParams
import torch
MODEL_NAME = "mistral-community/pixtral-12b"

sampling_params = SamplingParams(max_tokens=100, temperature=0.0)

context_length = 2000
num_device = 2
llm = LLM(model=MODEL_NAME, speculative_max_model_len =context_length ,max_seq_len_to_capture=context_length,max_model_len=context_length
, tensor_parallel_size=num_device,trust_remote_code=True ,worker_use_ray=num_device,dtype=torch.float16
          , enable_chunked_prefill=True
        ,gpu_memory_utilization = 0.99
          , enforce_eager=True 
          ,max_num_batched_tokens=context_length
         ) 

prompt = "Describe this image in one sentence."
image_path = "/kaggle/working/dubu.png" # Update the path to dubu.png

messages = [
    {
        "role":
        "user",
        "content": [
            {
                "type": "text",
                "text": prompt
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": image_path  # Using the local path for dubu.png
                }
            },
        ],
    },
]
outputs = llm.chat(messages, sampling_params=sampling_params)

print(outputs[0].outputs[0].text)

I have already installed latest vllm and !pip install git+https://github.com/huggingface/transformers too

Still i'm getting this error

Erros

WARNING 09-18 22:39:17 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
2024-09-18 22:39:23,218 INFO util.py:124 -- Outdated packages:
  ipywidgets==7.7.1 found, needs ipywidgets>=8
Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
INFO 09-18 22:39:24 config.py:1653] Downcasting torch.float32 to torch.float16.
INFO 09-18 22:39:24 config.py:1013] Chunked prefill is enabled with max_num_batched_tokens=2000.
WARNING 09-18 22:39:24 config.py:383] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
/opt/conda/lib/python3.10/subprocess.py:1796: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
  self.pid = _posixsubprocess.fork_exec(
/opt/conda/lib/python3.10/subprocess.py:1796: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
  self.pid = _posixsubprocess.fork_exec(
2024-09-18 22:39:27,085 INFO worker.py:1786 -- Started a local Ray instance.
INFO 09-18 22:39:28 llm_engine.py:223] Initializing an LLM engine (v0.6.1.post2) with config: model='mistral-community/pixtral-12b', speculative_config=None, tokenizer='mistral-community/pixtral-12b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=2000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=mistral-community/pixtral-12b, use_v2_block_manager=False, num_scheduler_steps=1, enable_prefix_caching=False, use_async_output_proc=False)
INFO 09-18 22:39:29 ray_gpu_executor.py:134] use_ray_spmd_worker: False
(pid=1855) WARNING 09-18 22:39:33 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
(pid=1900) WARNING 09-18 22:39:43 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
INFO 09-18 22:39:50 selector.py:217] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 09-18 22:39:50 selector.py:116] Using XFormers backend.
(RayWorkerWrapper pid=1900) INFO 09-18 22:39:50 selector.py:217] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(RayWorkerWrapper pid=1900) INFO 09-18 22:39:50 selector.py:116] Using XFormers backend.
(RayWorkerWrapper pid=1900) /opt/conda/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(RayWorkerWrapper pid=1900)   @torch.library.impl_abstract("xformers_flash::flash_fwd")
/opt/conda/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
(RayWorkerWrapper pid=1900) /opt/conda/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(RayWorkerWrapper pid=1900)   @torch.library.impl_abstract("xformers_flash::flash_bwd")
/opt/conda/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 09-18 22:39:53 utils.py:981] Found nccl from library libnccl.so.2
INFO 09-18 22:39:53 pynccl.py:63] vLLM is using nccl==2.20.5
(RayWorkerWrapper pid=1900) INFO 09-18 22:39:53 utils.py:981] Found nccl from library libnccl.so.2
(RayWorkerWrapper pid=1900) INFO 09-18 22:39:53 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 09-18 22:39:53 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
WARNING 09-18 22:39:53 custom_all_reduce.py:131] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 09-18 22:39:53 shm_broadcast.py:235] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7d0b3c61a080>, local_subscribe_port=42163, remote_subscribe_port=None)
INFO 09-18 22:39:53 model_runner.py:997] Starting to load model mistral-community/pixtral-12b...
(RayWorkerWrapper pid=1900) INFO 09-18 22:39:53 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(RayWorkerWrapper pid=1900) WARNING 09-18 22:39:53 custom_all_reduce.py:131] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(RayWorkerWrapper pid=1900) INFO 09-18 22:39:53 model_runner.py:997] Starting to load model mistral-community/pixtral-12b...
(RayWorkerWrapper pid=1900) ERROR 09-18 22:39:54 worker_base.py:464] Error executing method load_model. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=1900) ERROR 09-18 22:39:54 worker_base.py:464] Traceback (most recent call last):
(RayWorkerWrapper pid=1900) ERROR 09-18 22:39:54 worker_base.py:464]   File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 456, in execute_method
(RayWorkerWrapper pid=1900) ERROR 09-18 22:39:54 worker_base.py:464]     return executor(*args, **kwargs)
(RayWorkerWrapper pid=1900) ERROR 09-18 22:39:54 worker_base.py:464]   File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 183, in load_model
(RayWorkerWrapper pid=1900) ERROR 09-18 22:39:54 worker_base.py:464]     self.model_runner.load_model()
(RayWorkerWrapper pid=1900) ERROR 09-18 22:39:54 worker_base.py:464]   File "/opt/conda/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 999, in load_model
(RayWorkerWrapper pid=1900) ERROR 09-18 22:39:54 worker_base.py:464]     self.model = get_model(model_config=self.model_config,
(RayWorkerWrapper pid=1900) ERROR 09-18 22:39:54 worker_base.py:464]   File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/__init__.py", line 19, in get_model
(RayWorkerWrapper pid=1900) ERROR 09-18 22:39:54 worker_base.py:464]     return loader.load_model(model_config=model_config,
(RayWorkerWrapper pid=1900) ERROR 09-18 22:39:54 worker_base.py:464]   File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 358, in load_model
(RayWorkerWrapper pid=1900) ERROR 09-18 22:39:54 worker_base.py:464]     model = _initialize_model(model_config, self.load_config,
(RayWorkerWrapper pid=1900) ERROR 09-18 22:39:54 worker_base.py:464]   File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 172, in _initialize_model
(RayWorkerWrapper pid=1900) ERROR 09-18 22:39:54 worker_base.py:464]     return build_model(
(RayWorkerWrapper pid=1900) ERROR 09-18 22:39:54 worker_base.py:464]   File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 157, in build_model
(RayWorkerWrapper pid=1900) ERROR 09-18 22:39:54 worker_base.py:464]     return model_class(config=hf_config,
(RayWorkerWrapper pid=1900) ERROR 09-18 22:39:54 worker_base.py:464]   File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/llava.py", line 215, in __init__
(RayWorkerWrapper pid=1900) ERROR 09-18 22:39:54 worker_base.py:464]     self.vision_tower = _init_vision_tower(config)
(RayWorkerWrapper pid=1900) ERROR 09-18 22:39:54 worker_base.py:464]   File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/llava.py", line 195, in _init_vision_tower
(RayWorkerWrapper pid=1900) ERROR 09-18 22:39:54 worker_base.py:464]     raise NotImplementedError(msg)
(RayWorkerWrapper pid=1900) ERROR 09-18 22:39:54 worker_base.py:464] NotImplementedError: Unsupported vision config: <class 'transformers.models.pixtral.configuration_pixtral.PixtralVisionConfig'>
ERROR 09-18 22:39:54 worker_base.py:464] Error executing method load_model. This might cause deadlock in distributed execution.
ERROR 09-18 22:39:54 worker_base.py:464] Traceback (most recent call last):
ERROR 09-18 22:39:54 worker_base.py:464]   File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 456, in execute_method
ERROR 09-18 22:39:54 worker_base.py:464]     return executor(*args, **kwargs)
ERROR 09-18 22:39:54 worker_base.py:464]   File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 183, in load_model
ERROR 09-18 22:39:54 worker_base.py:464]     self.model_runner.load_model()
ERROR 09-18 22:39:54 worker_base.py:464]   File "/opt/conda/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 999, in load_model
ERROR 09-18 22:39:54 worker_base.py:464]     self.model = get_model(model_config=self.model_config,
ERROR 09-18 22:39:54 worker_base.py:464]   File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/__init__.py", line 19, in get_model
ERROR 09-18 22:39:54 worker_base.py:464]     return loader.load_model(model_config=model_config,
ERROR 09-18 22:39:54 worker_base.py:464]   File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 358, in load_model
ERROR 09-18 22:39:54 worker_base.py:464]     model = _initialize_model(model_config, self.load_config,
ERROR 09-18 22:39:54 worker_base.py:464]   File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 172, in _initialize_model
ERROR 09-18 22:39:54 worker_base.py:464]     return build_model(
ERROR 09-18 22:39:54 worker_base.py:464]   File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 157, in build_model
ERROR 09-18 22:39:54 worker_base.py:464]     return model_class(config=hf_config,
ERROR 09-18 22:39:54 worker_base.py:464]   File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/llava.py", line 215, in __init__
ERROR 09-18 22:39:54 worker_base.py:464]     self.vision_tower = _init_vision_tower(config)
ERROR 09-18 22:39:54 worker_base.py:464]   File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/llava.py", line 195, in _init_vision_tower
ERROR 09-18 22:39:54 worker_base.py:464]     raise NotImplementedError(msg)
ERROR 09-18 22:39:54 worker_base.py:464] NotImplementedError: Unsupported vision config: <class 'transformers.models.pixtral.configuration_pixtral.PixtralVisionConfig'>
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Cell In[1], line 10
      8 context_length = 2000
      9 num_device = 2
---> 10 llm = LLM(model=MODEL_NAME, speculative_max_model_len =context_length ,max_seq_len_to_capture=context_length,max_model_len=context_length
     11 , tensor_parallel_size=num_device,trust_remote_code=True ,worker_use_ray=num_device,dtype=torch.float16
     12           , enable_chunked_prefill=True
     13         ,gpu_memory_utilization = 0.99
     14           , enforce_eager=True 
     15           ,max_num_batched_tokens=context_length
     16          ) 
     19 prompt = "Describe this image in one sentence."
     20 image_path = "/kaggle/working/dubu.png" # Update the path to dubu.png

File /opt/conda/lib/python3.10/site-packages/vllm/entrypoints/llm.py:178, in LLM.__init__(self, model, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, cpu_offload_gb, enforce_eager, max_context_len_to_capture, max_seq_len_to_capture, disable_custom_all_reduce, disable_async_output_proc, **kwargs)
    154     raise TypeError(
    155         "There is no need to pass vision-related arguments anymore.")
    156 engine_args = EngineArgs(
    157     model=model,
    158     tokenizer=tokenizer,
   (...)
    176     **kwargs,
    177 )
--> 178 self.llm_engine = LLMEngine.from_engine_args(
    179     engine_args, usage_context=UsageContext.LLM_CLASS)
    180 self.request_counter = Counter()

File /opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py:550, in LLMEngine.from_engine_args(cls, engine_args, usage_context, stat_loggers)
    548 executor_class = cls._get_executor_cls(engine_config)
    549 # Create the LLM engine.
--> 550 engine = cls(
    551     **engine_config.to_dict(),
    552     executor_class=executor_class,
    553     log_stats=not engine_args.disable_log_stats,
    554     usage_context=usage_context,
    555     stat_loggers=stat_loggers,
    556 )
    558 return engine

File /opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py:317, in LLMEngine.__init__(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, speculative_config, decoding_config, observability_config, prompt_adapter_config, executor_class, log_stats, usage_context, stat_loggers, input_registry)
    313 self.input_registry = input_registry
    314 self.input_processor = input_registry.create_input_processor(
    315     model_config)
--> 317 self.model_executor = executor_class(
    318     model_config=model_config,
    319     cache_config=cache_config,
    320     parallel_config=parallel_config,
    321     scheduler_config=scheduler_config,
    322     device_config=device_config,
    323     lora_config=lora_config,
    324     speculative_config=speculative_config,
    325     load_config=load_config,
    326     prompt_adapter_config=prompt_adapter_config,
    327     observability_config=self.observability_config,
    328 )
    330 if not self.model_config.embedding_mode:
    331     self._initialize_kv_caches()

File /opt/conda/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py:26, in DistributedGPUExecutor.__init__(self, *args, **kwargs)
     22 # Updated by implementations that require additional args to be passed
     23 # to the _run_workers execute_model call
     24 self.extra_execute_model_run_workers_kwargs: Dict[str, Any] = {}
---> 26 super().__init__(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/vllm/executor/executor_base.py:47, in ExecutorBase.__init__(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, speculative_config, prompt_adapter_config, observability_config)
     45 self.prompt_adapter_config = prompt_adapter_config
     46 self.observability_config = observability_config
---> 47 self._init_executor()

File /opt/conda/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py:65, in RayGPUExecutor._init_executor(self)
     62     os.environ["RAY_USAGE_STATS_ENABLED"] = "0"
     64 # Create the parallel GPU workers.
---> 65 self._init_workers_ray(placement_group)
     67 self.input_encoder = msgspec.msgpack.Encoder(enc_hook=encode_hook)
     68 self.output_decoder = msgspec.msgpack.Decoder(
     69     Optional[List[SamplerOutput]])

File /opt/conda/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py:280, in RayGPUExecutor._init_workers_ray(self, placement_group, **ray_remote_kwargs)
    277 self._run_workers("init_worker", all_kwargs=init_worker_all_kwargs)
    279 self._run_workers("init_device")
--> 280 self._run_workers("load_model",
    281                   max_concurrent_workers=self.parallel_config.
    282                   max_parallel_loading_workers)
    284 if self.use_ray_spmd_worker:
    285     for pp_rank in range(self.parallel_config.pipeline_parallel_size):

File /opt/conda/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py:411, in RayGPUExecutor._run_workers(self, method, async_run_tensor_parallel_workers_only, all_args, all_kwargs, use_dummy_driver, max_concurrent_workers, *args, **kwargs)
    408 # Start the driver worker after all the ray workers.
    409 if not use_dummy_driver:
    410     driver_worker_output = [
--> 411         self.driver_worker.execute_method(method, *driver_args,
    412                                           **driver_kwargs)
    413     ]
    414 else:
    415     assert self.driver_dummy_worker is not None

File /opt/conda/lib/python3.10/site-packages/vllm/worker/worker_base.py:465, in WorkerWrapperBase.execute_method(self, method, *args, **kwargs)
    462 msg = (f"Error executing method {method}. "
    463        "This might cause deadlock in distributed execution.")
    464 logger.exception(msg)
--> 465 raise e

File /opt/conda/lib/python3.10/site-packages/vllm/worker/worker_base.py:456, in WorkerWrapperBase.execute_method(self, method, *args, **kwargs)
    454     target = self if self.worker is None else self.worker
    455     executor = getattr(target, method)
--> 456     return executor(*args, **kwargs)
    457 except Exception as e:
    458     # if the driver worker also execute methods,
    459     # exceptions in the rest worker may cause deadlock in rpc like ray
    460     # see https://github.com/vllm-project/vllm/issues/3455
    461     # print the error and inform the user to solve the error
    462     msg = (f"Error executing method {method}. "
    463            "This might cause deadlock in distributed execution.")

File /opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py:183, in Worker.load_model(self)
    182 def load_model(self):
--> 183     self.model_runner.load_model()

File /opt/conda/lib/python3.10/site-packages/vllm/worker/model_runner.py:999, in GPUModelRunnerBase.load_model(self)
    997 logger.info("Starting to load model %s...", self.model_config.model)
    998 with CudaMemoryProfiler() as m:
--> 999     self.model = get_model(model_config=self.model_config,
   1000                            device_config=self.device_config,
   1001                            load_config=self.load_config,
   1002                            lora_config=self.lora_config,
   1003                            parallel_config=self.parallel_config,
   1004                            scheduler_config=self.scheduler_config,
   1005                            cache_config=self.cache_config)
   1007 self.model_memory_usage = m.consumed_memory
   1008 logger.info("Loading model weights took %.4f GB",
   1009             self.model_memory_usage / float(2**30))

File /opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/__init__.py:19, in get_model(model_config, load_config, device_config, parallel_config, scheduler_config, lora_config, cache_config)
     13 def get_model(*, model_config: ModelConfig, load_config: LoadConfig,
     14               device_config: DeviceConfig, parallel_config: ParallelConfig,
     15               scheduler_config: SchedulerConfig,
     16               lora_config: Optional[LoRAConfig],
     17               cache_config: CacheConfig) -> nn.Module:
     18     loader = get_model_loader(load_config)
---> 19     return loader.load_model(model_config=model_config,
     20                              device_config=device_config,
     21                              lora_config=lora_config,
     22                              parallel_config=parallel_config,
     23                              scheduler_config=scheduler_config,
     24                              cache_config=cache_config)

File /opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py:358, in DefaultModelLoader.load_model(self, model_config, device_config, lora_config, parallel_config, scheduler_config, cache_config)
    356 with set_default_torch_dtype(model_config.dtype):
    357     with target_device:
--> 358         model = _initialize_model(model_config, self.load_config,
    359                                   lora_config, cache_config,
    360                                   scheduler_config)
    361     model.load_weights(
    362         self._get_weights_iterator(model_config.model,
    363                                    model_config.revision,
   (...)
    366                                        "fall_back_to_pt_during_load",
    367                                        True)), )
    369     for _, module in model.named_modules():

File /opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py:172, in _initialize_model(model_config, load_config, lora_config, cache_config, scheduler_config)
    169 """Initialize a model with the given configurations."""
    170 model_class, _ = get_model_architecture(model_config)
--> 172 return build_model(
    173     model_class,
    174     model_config.hf_config,
    175     cache_config=cache_config,
    176     quant_config=_get_quantization_config(model_config, load_config),
    177     lora_config=lora_config,
    178     multimodal_config=model_config.multimodal_config,
    179     scheduler_config=scheduler_config,
    180 )

File /opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py:157, in build_model(model_class, hf_config, cache_config, quant_config, lora_config, multimodal_config, scheduler_config)
    147 def build_model(model_class: Type[nn.Module], hf_config: PretrainedConfig,
    148                 cache_config: Optional[CacheConfig],
    149                 quant_config: Optional[QuantizationConfig], *,
    150                 lora_config: Optional[LoRAConfig],
    151                 multimodal_config: Optional[MultiModalConfig],
    152                 scheduler_config: Optional[SchedulerConfig]) -> nn.Module:
    153     extra_kwargs = _get_model_initialization_kwargs(model_class, lora_config,
    154                                                     multimodal_config,
    155                                                     scheduler_config)
--> 157     return model_class(config=hf_config,
    158                        cache_config=cache_config,
    159                        quant_config=quant_config,
    160                        **extra_kwargs)

File /opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/llava.py:215, in LlavaForConditionalGeneration.__init__(self, config, multimodal_config, cache_config, quant_config)
    212 self.multimodal_config = multimodal_config
    214 # TODO: Optionally initializes this for supporting embeddings.
--> 215 self.vision_tower = _init_vision_tower(config)
    216 self.multi_modal_projector = LlavaMultiModalProjector(
    217     vision_hidden_size=config.vision_config.hidden_size,
    218     text_hidden_size=config.text_config.hidden_size,
    219     projector_hidden_act=config.projector_hidden_act)
    221 self.language_model = init_vllm_registered_model(
    222     config.text_config, cache_config, quant_config)

File /opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/llava.py:195, in _init_vision_tower(hf_config)
    189     return SiglipVisionModel(
    190         vision_config,
    191         num_hidden_layers_override=num_hidden_layers,
    192     )
    194 msg = f"Unsupported vision config: {type(vision_config)}"
--> 195 raise NotImplementedError(msg)

NotImplementedError: Unsupported vision config: <class 'transformers.models.pixtral.configuration_pixtral.PixtralVisionConfig'>
2024-09-18 22:39:59,579 ERROR worker.py:409 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::RayWorkerWrapper.execute_method() (pid=1900, ip=172.19.2.2, actor_id=7baaf6bf819cd8dd6cc28cc101000000, repr=<vllm.executor.ray_utils.RayWorkerWrapper object at 0x7f0217cd22c0>)
  File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 465, in execute_method
    raise e
  File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 456, in execute_method
    return executor(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 183, in load_model
    self.model_runner.load_model()
  File "/opt/conda/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 999, in load_model
    self.model = get_model(model_config=self.model_config,
  File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/__init__.py", line 19, in get_model
    return loader.load_model(model_config=model_config,
  File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 358, in load_model
    model = _initialize_model(model_config, self.load_config,
  File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 172, in _initialize_model
    return build_model(
  File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 157, in build_model
    return model_class(config=hf_config,
  File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/llava.py", line 215, in __init__
    self.vision_tower = _init_vision_tower(config)
  File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/llava.py", line 195, in _init_vision_tower
    raise NotImplementedError(msg)
NotImplementedError: Unsupported vision config: <class 'transformers.models.pixtral.configuration_pixtral.PixtralVisionConfig'>
add Codeadd Markdown
dahwin commented 1 day ago

I have also tried with https://huggingface.co/leafspark/Pixtral-12B-2409-hf

still getting different error

bet0x commented 20 hours ago

Mine is a little different:

/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd")
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 236, in run_rpc_server
    server = AsyncEngineRPCServer(async_engine_args, usage_context, rpc_path)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 34, in __init__
    self.engine = AsyncLLMEngine.from_engine_args(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 573, in from_engine_args
    engine = cls(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 473, in __init__
    self.engine = self._engine_class(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 257, in __init__
    super().__init__(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 317, in __init__
    self.model_executor = executor_class(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 47, in __init__
    self._init_executor()
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 40, in _init_executor
    self.driver_worker.load_model()
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 183, in load_model
    self.model_runner.load_model()
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 999, in load_model
    self.model = get_model(model_config=self.model_config,
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 19, in get_model
    return loader.load_model(model_config=model_config,
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 358, in load_model
    model = _initialize_model(model_config, self.load_config,
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 172, in _initialize_model
    return build_model(
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 157, in build_model
    return model_class(config=hf_config,
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/pixtral.py", line 148, in __init__
    for key, value in self.config.vision_config.to_dict().items()
  File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 264, in __getattribute__
    return super().__getattribute__(key)
AttributeError: 'MistralConfig' object has no attribute 'vision_config'
ERROR 09-19 19:09:28 api_server.py:188] RPCServer process died before responding to readiness probe