jquesnelle / yarn

YaRN: Efficient Context Window Extension of Large Language Models
MIT License
1.32k stars 115 forks source link

Linear Scaled Embedding Has Different Implementation? #6

Closed fahadh4ilyas closed 1 year ago

fahadh4ilyas commented 1 year ago

I compare your code with The Bloke code for Linear Scaled Embedding. Somehow there are some difference:

  1. Your code change the scale self.scale = 1/scale which make it fraction but then divide t with fractioned scale (t /= self.scale). But The bloke code multiply t with fractioned scale. Which one is right?
  2. Your code max_position_embeddings seems stays at 2048. But The Bloke code change it according to max context length. Or did you actualy change the max_position_embeddings in the config file?

Which one follow the implementation from kaiokendev?

bloc97 commented 1 year ago

Both are equivalent, but ours tries to follow huggingface's format as much as possible for drop-in future compatibility.