Closed zanussbaum closed 1 week ago
I think the implementations are the same with the only difference that the above implementation expects the input to be batch size x sequence_length x heads x dims
while nn.RoPE
expects it to be batch size x heads x sequence length x dims
.
For instance the following code showcases that they produce the same results:
import mlx.core as mx
import mlx.nn as nn
class NomicRoPE(nn.Module):
...
x = mx.ones((100, 100))
rr = nn.RoPE(100)
nr = NomicRoPE(100)
assert (rr(x[None, None]).squeeze() - nr(x[None, :, None]).squeeze()).abs().max() < 1e-4
Let me know if I made a mistake, otherwise feel free to close the issue :-)
AH! that's definitely it thank you. Is it possible to add something like this to the documentation? It wasn't obvious from looking here or at the code that it needs a different input shape (although I should have checked in hindsight :D )
Describe the bug Hi all, I wasn't able to get the same outputs using the
nn.RoPe
provided in MLX for the Nomic Embed models. I tried using bothtraditional=True
andtraditional=False
but it seems like there's a slight difference in the implementations.I ended up rewriting the Nomic version of RoPe in python. let me know if there are better ways to contribute to a metal kernel, it would be great to see that added! IIRC our RoPe version is very similar to Flash Attentions and GPT-NeoX.
I tried looking at the cpp code quickly but wasn't able to find where the implementations really differ. Let me know how I can help!
To Reproduce
Include code snippet
for PR here: https://github.com/taylorai/mlx_embedding_models/pull/4 Expected behavior A clear and concise description of what you expected to happen.
Desktop (please complete the following information):
Additional context Add any other context about the problem here.