microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
19.08k stars 2.43k forks source link

Implementation of RoPE in YOCO #1554

Closed nkkbr closed 1 month ago

nkkbr commented 1 month ago

In this file:

YOCO/yoco/models/decoder/yoco.py

RoPE was implemented as:

    def build_rel_pos(self, x, start_pos):
        if self._precomputed_freqs_cis is None:
            angle = 1.0 / (self.args.rope_theta ** torch.linspace(0, 1, self.head_dim // 2, dtype=torch.float, device=x.device))
            index = torch.arange(self.args.max_seq_len).to(angle)
            self._precomputed_freqs_cis = index[:, None] * angle

        cos = torch.cos(self._precomputed_freqs_cis[start_pos:start_pos+x.size(1)])
        sin = torch.sin(self._precomputed_freqs_cis[start_pos:start_pos+x.size(1)])
        rel_pos = (cos.to(x.dtype), sin.to(x.dtype))
        return rel_pos

I wonder if the angle should be:

angle = 1.0 / (self.args.rope_theta ** torch.linspace(0, 1, self.head_dim // 2 + 1, dtype=torch.float, device=x.device))
angle = angle[:-1]
sunyt32 commented 1 month ago

In practice, the performance is almost the same between these two implementations. We use torch.linspace for simplicity.