vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
28.18k stars 4.17k forks source link

[Bug]: The implementation of DynamicNTKScalingRotaryEmbedding may have errors. #5093

Open macheng6 opened 4 months ago

macheng6 commented 4 months ago

Your current environment

The output of `python collect_env.py`

I noticed that the implementation of dynamic ntk recalculates the base parameter for all lengths, which is inconsistent with the implementation of transformers?

šŸ› Describe the bug

def _compute_cos_sin_cache(self) -> torch.Tensor:

NOTE(woosuk): self.max_position_embeddings is the original

    # maximum length before applying the rope scaling.
    # Thus, the maximum length after applying the rope scaling is
    # self.max_position_embeddings * self.scaling_factor.

    max_len = self.max_position_embeddings * self.scaling_factor
    base = self.base * (
        (self.scaling_factor * max_len / self.max_position_embeddings) -
        (self.scaling_factor - 1))**(self.rotary_dim /
                                     (self.rotary_dim - 2))
    inv_freq = self._compute_inv_freq(base)
    t = torch.arange(max_len, dtype=torch.float)

    freqs = torch.einsum("i,j -> ij", t, inv_freq)
    cos = freqs.cos()
    sin = freqs.sin()
    cache = torch.cat((cos, sin), dim=-1)
    return cache
ymwangg commented 4 months ago

Yes, I noticed similar issue. The current dynamic NTK scaling is actually static NTK scaling. It may be tricky and inefficient to implement dynamic NTK for model server that needs to handle lots of concurrent requests.

Missmiaom commented 2 months ago

+1

qiufengyuyi commented 2 months ago

+1