I am training model based on "mistralai/Mistral-7B-Instruct-v0.1"

but I am unable to serve it using docker image vllm/vllm-openai:latest

I am exe using python3 -m vllm.entrypoints.openai.api_server --model --gpu-memory-utilization 0.90

I tried rebooting the instant and now I am getting only following error constantly repeating: ../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [8970,0,0], thread: [36,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. ../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [6527,0,0], thread: [34,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. ...

training parameters below:

################################################################################

QLoRA parameters

################################################################################

LoRA attention dimension

lora_r = 64

Alpha parameter for LoRA scaling

lora_alpha = 16

Dropout probability for LoRA layers

lora_dropout = 0.1

################################################################################

bitsandbytes parameters

################################################################################

Activate 4-bit precision base model loading

use_4bit = True

Compute dtype for 4-bit base models

bnb_4bit_compute_dtype = "float16"

Quantization type (fp4 or nf4)

bnb_4bit_quant_type = "nf4"

Activate nested quantization for 4-bit base models (double quantization)

use_nested_quant = False

################################################################################

TrainingArguments parameters

################################################################################

Output directory where the model predictions and checkpoints will be stored

output_dir = "./results"

Number of training epochs

num_train_epochs = 1

Enable fp16/bf16 training (set bf16 to True with an A100)

fp16 = False bf16 = False

Batch size per GPU for training

per_device_train_batch_size = 4

Batch size per GPU for evaluation

per_device_eval_batch_size = 4

Number of update steps to accumulate the gradients for

gradient_accumulation_steps = 1

Enable gradient checkpointing

gradient_checkpointing = True

Maximum gradient normal (gradient clipping)

max_grad_norm = 0.3

Initial learning rate (AdamW optimizer)

learning_rate = 2e-4

Weight decay to apply to all layers except bias/LayerNorm weights

weight_decay = 0.001

Optimizer to use

optim = "paged_adamw_32bit"

Learning rate schedule (constant a bit better than cosine)

lr_scheduler_type = "constant"

Number of training steps (overrides num_train_epochs)

max_steps = -1

Ratio of steps for a linear warmup (from 0 to learning rate)

warmup_ratio = 0.03

Group sequences into batches with same length

Saves memory and speeds up training considerably

group_by_length = True

Save checkpoint every X updates steps

save_steps = 25

Log every X updates steps

logging_steps = 25

################################################################################

SFT parameters

################################################################################

Maximum sequence length to use

max_seq_length = None

Pack multiple short examples in the same input sequence to increase efficiency

packing = False

Load the entire model on the GPU 0

device_map = {"": 0}

I am trying to serve model with

python -O -u -m vllm.entrypoints.openai.api_server \ --host=127.0.0.1 \ --port=8000 \ --model=$MODEL_DIR \ --tokenizer=hf-internal-testing/llama-tokenizer

but error returned is: INFO 12-31 11:44:10 api_server.py:719] args: Namespace(host='127.0.0.1', port=8000, allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=[''], served_model_name=None, chat_template=None, response_role='assistant', model='/mistral/mistralai-Instruct-mssql', tokenizer='hf-internal-testing/llama-tokenizer', revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, engine_use_ray=False, disable_log_requests=False, max_log_len=None) INFO 12-31 11:44:10 llm_engine.py:73] Initializing an LLM engine with config: model='/mistral/mistralai-Instruct-mssql', tokenizer='hf-internal-testing/llama-tokenizer', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, enforce_eager=False, seed=0) tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████| 700/700 [00:00<00:00, 2.81MB/s] tokenizer.model: 100%|█████████████████████████████████████████████████████████████████████████████████| 500k/500k [00:00<00:00, 14.6MB/s] tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████| 1.84M/1.84M [00:00<00:00, 3.59MB/s] special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████| 411/411 [00:00<00:00, 2.13MB/s] Traceback (most recent call last): File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 729, in engine = AsyncLLMEngine.from_engine_args(engine_args) File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 496, in from_engine_args engine = cls(parallel_config.worker_use_ray, File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 269, in init self.engine = self._init_engine(args, kwargs) File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 314, in _init_engine return engine_class(*args, *kwargs) File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 110, in init self._init_workers(distributed_init_method) File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 146, in _init_workers self._run_workers( File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 755, in _run_workers self._run_workers_in_batch(workers, method, args, kwargs)) File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 729, in _run_workers_in_batch output = executor(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 79, in load_model self.model_runner.load_model() File "/opt/conda/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 57, in load_model self.model = get_model(self.model_config) File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader.py", line 72, in get_model model.load_weights(model_config.model, model_config.download_dir, File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/mistral.py", line 328, in load_weights param = params_dict[name] KeyError: 'base_model.model.lm_head.base_layer.weight'

when I set to model tokenised python -O -u -m vllm.entrypoints.openai.api_server \ --host=127.0.0.1 \ --port=8000 \ --model=$MODEL_DIR \ --tokenizer=$MODEL_DIR

I am still getting the same error: I suspect that model weights would need to be loaded but I am not sure how?

INFO 12-31 11:56:21 api_server.py:719] args: Namespace(host='127.0.0.1', port=8000, allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=[''], served_model_name=None, chat_template=None, response_role='assistant', model='/mistral/mistralai-Instruct-mssql', tokenizer='/mistral/mistralai-Instruct-mssql', revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, engine_use_ray=False, disable_log_requests=False, max_log_len=None) INFO 12-31 11:56:21 llm_engine.py:73] Initializing an LLM engine with config: model='/mistral/mistralai-Instruct-mssql', tokenizer='/mistral/mistralai-Instruct-mssql', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, enforce_eager=False, seed=0) Traceback (most recent call last): File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 729, in engine = AsyncLLMEngine.from_engine_args(engine_args) File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 496, in from_engine_args engine = cls(parallel_config.worker_use_ray, File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 269, in init self.engine = self._init_engine(args, kwargs) File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 314, in _init_engine return engine_class(*args, *kwargs) File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 110, in init self._init_workers(distributed_init_method) File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 146, in _init_workers self._run_workers( File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 755, in _run_workers self._run_workers_in_batch(workers, method, args, kwargs)) File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 729, in _run_workers_in_batch output = executor(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 79, in load_model self.model_runner.load_model() File "/opt/conda/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 57, in load_model self.model = get_model(self.model_config) File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader.py", line 72, in get_model model.load_weights(model_config.model, model_config.download_dir, File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/mistral.py", line 328, in load_weights param = params_dict[name] KeyError: 'base_model.model.lm_head.base_layer.weight'

Note that you the mile is mystral.py vs mixtral.py which would seem to be the incorrect model selection. Here is the code that is causing the error: else:

Skip loading extra bias for GPTQ models.

            if name.endswith(".bias") and name not in params_dict:
                continue
            param = params_dict[name].    // This is the line causing the error. You can debug the name but I think it shoul be calling mixtral not mistral.py
            weight_loader = getattr(param, "weight_loader",
                                    default_weight_loader)
            weight_loader(param, loaded_weight).      /// Please check that you are requesting Mixtral not mistral because it is invoking mystral.py when it has mixtral.py file under /models/mistral.py and mixtral.py.  Be careful of naming of model.

Please check the source for mystral.py and note that mixtral.py should actually have been called instead. https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/mistral.py

python3 -m vllm.entrypoints.openai.api_server --model --gpu-memory-utilization 0.90. // This does not look right for model name. That is most likey the problem. However mistral code should probably have detectd this earlier and given a different response if it had the wrong name.

vllm-project / vllm

serving 4-bit trained model #2311

QLoRA parameters

LoRA attention dimension

Alpha parameter for LoRA scaling

Dropout probability for LoRA layers

bitsandbytes parameters

Activate 4-bit precision base model loading

Compute dtype for 4-bit base models

Quantization type (fp4 or nf4)

Activate nested quantization for 4-bit base models (double quantization)

TrainingArguments parameters

Output directory where the model predictions and checkpoints will be stored

Number of training epochs

Enable fp16/bf16 training (set bf16 to True with an A100)

Batch size per GPU for training

Batch size per GPU for evaluation

Number of update steps to accumulate the gradients for

Enable gradient checkpointing

Maximum gradient normal (gradient clipping)

Initial learning rate (AdamW optimizer)

Weight decay to apply to all layers except bias/LayerNorm weights

Optimizer to use

Learning rate schedule (constant a bit better than cosine)

Number of training steps (overrides num_train_epochs)

Ratio of steps for a linear warmup (from 0 to learning rate)

Group sequences into batches with same length

Saves memory and speeds up training considerably

Save checkpoint every X updates steps

Log every X updates steps

SFT parameters

Maximum sequence length to use

Pack multiple short examples in the same input sequence to increase efficiency

Load the entire model on the GPU 0

Skip loading extra bias for GPTQ models.