Possible bug in gradient size of mul_backward_kernel()?

Hi guys,

First of all, congratulations on this great work and thank you for making the code open-source.

While going through the implementation details, I noticed that the Grad type in the mul_backward_kernel() method in https://github.com/princeton-vl/lietorch/blob/0fa9ce8ffca86d985eca9e189a99690d6f3d4df6/lietorch/src/lietorch_cpu.cpp#L124 is accessed from the memory with the dimension N which appears to be the embedding size (or the representation size). Shouldn't this instead be accessed with dimension K https://github.com/princeton-vl/lietorch/blob/0fa9ce8ffca86d985eca9e189a99690d6f3d4df6/lietorch/src/lietorch_cpu.cpp#L119 which is the dimension of the tangent space that satisfies the chain rule described in Eq. (14) of the paper.

Please let me know if this observation is correct or I have misunderstood this as a bug? Thanks in advance. Keep up the great work :)

princeton-vl / lietorch