keroro824 / HashingDeepLearning

Codebase for "SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems"
MIT License
1.07k stars 169 forks source link

Porting to Cython #7

Open mritunjaymusale opened 4 years ago

mritunjaymusale commented 4 years ago

Is it possible for you to port this into Cython and release it as PyPI package ? It would be easy for existing DL users(tf and pytorch users) to use it natively in their code.

keroro824 commented 4 years ago

Thanks for your suggestion. Let us make this the priority! We'll @ you when it is done.

rahulunair commented 4 years ago

Thank you @keroro824 !

wrathematics commented 4 years ago

Sort of related, but I've been building R bindings.

keroro824 commented 4 years ago

@wrathematics Thanks for contributing 👍

its-sandy commented 3 years ago

Hi, are there any updates on this?

nomadbl commented 2 years ago

Is it possible for you to port this into Cython and release it as PyPI package ? It would be easy for existing DL users(tf and pytorch users) to use it natively in their code.

I'm also interested in implementing such a thing. But it seems to me the way to do this would be to implement custom layers instead of builtin ones. This could be added to the main codebase once it is tested rather than a separate package.

For example in pytorch you would first subclass 'torch.autograd.Function' to implement forward and backward operations which calculate the hashing operations and take that into account in forward and back propagation. Cython might not be needed as you might be able to use numba and get better performance more easily.

@keroro824 I've actually started doing what I described. I have a question: Do you have some justification for only propagating the gradient to active neurons? It's not obvious to me why this would be a good approximation of the true gradient. There is another method the math would suggest: the gradient w.r.t the input of a linear layer is (repeated indices indicate a sum): yi = W{ij} x_{j} + b_i dx_k = dy_i W_{ik} So we can use LSH for calculating the backprop but we need more hash tables than the paper suggests. The multiplications in the backprop are by columns of the weight matrix, and the forward prop is multiplication by rows of the weight matrix. Did you try something like this?

It would be very interesting to me to implement this.