Closed johnma2006 closed 11 months ago
Thanks @johnma2006 I didn’t check gradient when indexed with list/tensor (fixed now)
btw how did you found this?
I found it from inspection, I'm doing a similar project and I remembered slicing was annoying to get right so I paid a closer attention to how you were doing it.
Numpy/torch slicing seems not consistent with one another, for instance here is an example (try it in numpy and pytorch):
embedding = Parameter(rand(50257, 512))
token_batch = [[0, 1], [3, 4]]
tok_emb = embedding[token_batch]
print(tok_emb.shape) # this is (2, 2, 512) in numpy, (2,) in torch
tok_emb.sum().backward()
print(embedding.grad.shape)
Hi,
Great work on this library. I enjoyed reading through it. There is a small bug in the tensor slicing backprop implementation. Here is a minimal repro. example:
which gives
[1, 0]
whereas[2, 0]
is correct. This affects the embedding layers in GPT.Best, John