vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.75k stars 4.49k forks source link

[Feature]: inconsistent vocab_sizes support for draft and target workers while using Speculative Decoding #5203

Closed ShangmingCai closed 5 months ago

ShangmingCai commented 5 months ago

🚀 The feature, motivation and pitch

Currently, vllm with Speculative Decoding requires that the draft model and target model have the same vocab size. However, the target model may have a larger parameter size and use more data to train, so the vocab size may also be slightly larger (If we consider using a 0.5B model as the draft model and a 72B model as the target model). In spec_decode_worker.py this file, line 564:asserts all(vocab_sizes[0] == vocab_size for vocab_size in vocab_sizes) making the inference pipeline unable to run. If this line is removed, the program may encounter a reshape error. So is inconsistent vocab sizes implementation possible? Such as using padding for the vocab size for the draft model?

Alternatives

No response

Additional context

No response

ShangmingCai commented 5 months ago

Found a way to work around.

devdev999 commented 5 months ago

How did you work around this? Facing the same issue now with llama3 and tinyllama

ShangmingCai commented 5 months ago

How did you work around this? Facing the same issue now with llama3 and tinyllama

You can modify the model file (i.e., vllm/model_executor/models/your_tinyllama_model.py) to concat the embed_tokens.weight with padding tensor. Then change the vocab_size param to align with your llama3 model in your_tinyllama_path/config.py to work around.

AU-CPU commented 4 months ago

Found a way to work around.

Have you solved the problem of inconsistent vocab_size? If you solved it, can you show me how you did it?

naturomics commented 4 months ago

I have a simple work around.

my case:

method:

  1. update the vocab_size config of draft model (Qwen1.5-0.5B-Chat/config.json) to 152064
  2. modify the following two lines to pad vocab embedding tensor in vllm/model_executor/layers/vocab_parallel_embedding.py:

line 326

    # original
    assert loaded_weight.shape[output_dim] == self.org_vocab_size

    # modified
    assert loaded_weight.shape[output_dim] <= self.org_vocab_size   # from `==`  to  `<=`

line 329

  # original
  loaded_weight = loaded_weight.narrow(output_dim, start_idx, shard_size)

  # modified
  if shard_size <= loaded_weight.shape[0]:
            loaded_weight = loaded_weight.narrow(output_dim, start_idx, shard_size)

That's all.

LUXUS1 commented 4 months ago

I have a simple work around.

my case:

  • vllm version: 0.5.1
  • draft model: Qwen1.5-0.5B-Chat, with vocab size 151936
  • target model: Qwen1.5-14B-Chat, with vocab size 152064

method:

  1. update the vocab_size config of draft model (Qwen1.5-0.5B-Chat/config.json) to 152064
  2. modify the following two lines to pad vocab embedding tensor in vllm/model_executor/layers/vocab_parallel_embedding.py:

line 326

    # original
    assert loaded_weight.shape[output_dim] == self.org_vocab_size

    # modified
    assert loaded_weight.shape[output_dim] <= self.org_vocab_size   # from `==`  to  `<=`

line 329

  # original
  loaded_weight = loaded_weight.narrow(output_dim, start_idx, shard_size)

  # modified
  if shard_size <= loaded_weight.shape[0]:
            loaded_weight = loaded_weight.narrow(output_dim, start_idx, shard_size)

That's all.

I tried the aforementioned method using Qwen2-7B-Instruct and Qwen2-0.5B-Instruct, and it works on a single GPU. However, on multiple GPUs, it shows the following error: image

ShangmingCai commented 4 months ago

I have a simple work around. my case:

  • vllm version: 0.5.1
  • draft model: Qwen1.5-0.5B-Chat, with vocab size 151936
  • target model: Qwen1.5-14B-Chat, with vocab size 152064

method:

  1. update the vocab_size config of draft model (Qwen1.5-0.5B-Chat/config.json) to 152064
  2. modify the following two lines to pad vocab embedding tensor in vllm/model_executor/layers/vocab_parallel_embedding.py:

line 326

    # original
    assert loaded_weight.shape[output_dim] == self.org_vocab_size

    # modified
    assert loaded_weight.shape[output_dim] <= self.org_vocab_size   # from `==`  to  `<=`

line 329

  # original
  loaded_weight = loaded_weight.narrow(output_dim, start_idx, shard_size)

  # modified
  if shard_size <= loaded_weight.shape[0]:
            loaded_weight = loaded_weight.narrow(output_dim, start_idx, shard_size)

That's all.

I tried the aforementioned method using Qwen2-7B-Instruct and Qwen2-0.5B-Instruct, and it works on a single GPU. However, on multiple GPUs, it shows the following error

Try modifying the line 400 of vllm/model_executor/models/qwen2.py from weight_loader(param, loaded_weight) to

                if name == "model.embed_tokens.weight":
                    if loaded_weight.shape[0] == 151936:
                        padding = torch.zeros((128, 1024), dtype=loaded_weight.dtype, device=loaded_weight.device)
                        loaded_weight = torch.cat([loaded_weight, padding], dim=0)
                    weight_loader(param, loaded_weight)
                else:
                    weight_loader(param, loaded_weight)

You might have to do the padding on weight actually to pass tons of tensor_shape assertion checks.