I compare your code with The Bloke code for Linear Scaled Embedding. Somehow there are some difference:
Your code change the scale self.scale = 1/scale which make it fraction but then divide t with fractioned scale (t /= self.scale). But The bloke code multiply t with fractioned scale. Which one is right?
Your code max_position_embeddings seems stays at 2048. But The Bloke code change it according to max context length. Or did you actualy change the max_position_embeddings in the config file?
Which one follow the implementation from kaiokendev?
I compare your code with The Bloke code for Linear Scaled Embedding. Somehow there are some difference:
self.scale = 1/scale
which make it fraction but then dividet
with fractioned scale (t /= self.scale
). But The bloke code multiplyt
with fractioned scale. Which one is right?max_position_embeddings
seems stays at 2048. But The Bloke code change it according to max context length. Or did you actualy change themax_position_embeddings
in the config file?Which one follow the implementation from kaiokendev?