vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
25.98k stars 3.8k forks source link

[Feature]: mistralai/Mistral-Nemo-Instruct-2407 support #6545

Closed bjoernpl closed 1 month ago

bjoernpl commented 1 month ago

🚀 The feature, motivation and pitch

Apparently outperforms Mixtral at a smaller size. Longer context length and multilingual. https://github.com/mistralai/mistral-inference/#deployment for Dockerfile (requires updating transformers).

Currently doesn't run with --tensor-parallel-size=2on vllm/vllm-openai:latest:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 216, in <module>
[rank0]:     engine = AsyncLLMEngine.from_engine_args(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 431, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 360, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 507, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 243, in __init__
[rank0]:     self.model_executor = executor_class(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 153, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 128, in __init__
[rank0]:     super().__init__(model_config, cache_config, parallel_config,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 42, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 79, in _init_executor
[rank0]:     self._run_workers("load_model",
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 134, in _run_workers
[rank0]:     ] + [output.get() for output in worker_outputs]
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 134, in <listcomp>
[rank0]:     ] + [output.get() for output in worker_outputs]
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 58, in get
[rank0]:     raise self.result.exception
[rank0]: RuntimeError: start (640) + length (640) exceeds dimension size (1024).

Can't test with tp=1.

Alternatives

No response

Additional context

No response

zifeitong commented 1 month ago

Likely related: https://github.com/huggingface/transformers/pull/32050

evannorstrand-mp commented 1 month ago

@mgoin and @zifeitong - I don't think this is ready to be closed. I am trying to run on two A100 and i'm getting memory errors.

INFO 07-18 22:03:54 api_server.py:215] vLLM API server version 0.5.2 INFO 07-18 22:03:54 api_server.py:216] args: Namespace(host=None, port=8888, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='./Mistral-Nemo-Instruct-2407/', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=True, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=['aiwingman'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None) INFO 07-18 22:03:54 llm_engine.py:175] Initializing an LLM engine (v0.5.2) with config: model='./Mistral-Nemo-Instruct-2407/', speculative_config=None, tokenizer='./Mistral-Nemo-Instruct-2407/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=1024000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=aiwingman, use_v2_block_manager=False, enable_prefix_caching=False) INFO 07-18 22:04:04 model_runner.py:563] Loading model weights took 23.0574 GB rank0: Traceback (most recent call last): rank0: File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main rank0: return _run_code(code, main_globals, None, rank0: File "/usr/lib/python3.10/runpy.py", line 86, in _run_code rank0: exec(code, run_globals) rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 289, in

rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 227, in run_server rank0: if llm_engine is not None else AsyncLLMEngine.from_engine_args( rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 446, in from_engine_args rank0: engine = cls( rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 374, in init rank0: self.engine = self._init_engine(*args, *kwargs) rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 528, in _init_engine rank0: return engine_class(args, **kwargs) rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 264, in init

rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 363, in _initialize_kv_caches

rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 78, in determine_num_available_blocks rank0: return self.driver_worker.determine_num_available_blocks() rank0: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(*args, **kwargs) rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 179, in determine_num_available_blocks

rank0: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(*args, kwargs) rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 767, in profile_run rank0: self.execute_model(model_input, kv_caches, intermediate_tensors) rank0: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(*args, *kwargs) rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1185, in execute_model rank0: hidden_or_intermediate_states = model_executable( rank0: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(args, kwargs) rank0: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(*args, kwargs) rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 407, in forward rank0: model_output = self.model(input_ids, positions, kv_caches, rank0: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(*args, *kwargs) rank0: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(args, kwargs) rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 308, in forward rank0: hidden_states, residual = layer( rank0: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(*args, kwargs) rank0: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(*args, *kwargs) rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 244, in forward rank0: hidden_states = self.mlp(hidden_states) rank0: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(args, kwargs) rank0: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(*args, kwargs) rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 82, in forward rank0: gateup, = self.gate_up_proj(x) rank0: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(*args, *kwargs) rank0: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(args, kwargs) rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 314, in forward rank0: output_parallel = self.quantmethod.apply(self, input, bias) rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 122, in apply rank0: return F.linear(x, layer.weight, bias) rank0: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 54.69 GiB. GPU

mgoin commented 1 month ago

@evannorstrand-mp As noted in the PR, this model has a very large default sequence length. You can see this in the logs with max_seq_len=1024000

Please set max_model_len to a reasonable number for your use-case like max_model_len=4096

evannorstrand-mp commented 1 month ago

That would def. do it. I expected 160GB VRAM to be enough to hold the full model, thanks for the response! Missed this information in the PR

ParisNeo commented 1 month ago

I have another problem with this model on vllm. I think it is not yet supported as I couldn't even make it work on llamacpp. I think the engines are not yet compatible. So for now the only way is to use the mistralai engine itself.

nightflight-dk commented 1 month ago

this model's support needs a bit more work [rank0]: ValueError: When using LoRA, vocab size must be 32000 >= vocab_size <= 128512 actual: "vocab_size": 131072