[Feature request] L_p distances in meshgrid on CUDA

Let's assume I have 2 huge matrices A and B of dimensions a x d and b x d. Let's assume that a matrix of dimensions a x b x d cannot fit on my 16 GB of GPU memory, but a x b can fit. If I do matmul(A, B.t()) on GPU everything works nicely since there is no intermediate a x b x d matrix to be stored. But if I want to do any other operation that would compute instead of A@B.t() , let's say L_p distance between each row of A and each column of B, I need to keep an axbxd intermediate matrix on the GPU.

Would be nice to have this implemented in CUDA in a similar manner with matrix multiplication. Any pointers on how to do this are welcome.

torch / torch7

[Feature request] L_p distances in meshgrid on CUDA #1174