khakhulin / compressed-transformer

Compression of NMT transformer model with tensor methods
MIT License
46 stars 9 forks source link

inference time #6

Open 2020zyc opened 5 years ago

2020zyc commented 5 years ago

hi, I am in a puzzle about the inference time of the compressed model. Why is the compressed model more time consuming? Shouldn't it be faster with fewer parameters(about half of the orignal) ?

thx

saparina commented 5 years ago

Hi! My explanation is that tensor decomposition methods require more mathematical operations: instead of one (highly optimized in Pytorch) matrix multiplication, we have several. I think it is possible to optimize our code of Tensor Train and Tucker methods and make it faster, but it is not obvious how to do it more efficiently.

saareliad commented 5 years ago

It could be implemented it in one operation with einsum, however pytorch does not fully support broadcasting for einsum. (it did worked for me in numpy though).

However, I assume that torch.einsum calls many matmul operations "behind the surface" (like it does in tensorflow) so it won't be much better.

I also thought about implementing it as a numba kernel (however found that numba does not support einsum too).

2020zyc commented 5 years ago

thanks @saareliad

Can einsum accelerate the many matmul operations produced by tt/tucker decomposition?

It could be implemented it in one operation with einsum

And how to implement with einsum in one operation?

saareliad commented 5 years ago

thanks @saareliad

Can einsum accelerate the many matmul operations produced by tt/tucker decomposition?

It could be implemented it in one operation with einsum

And how to implement with einsum in one operation?

Most of einsum code runs in C++ so it should be faster. I didn't check extensively. I believe that for top-optimized code one should re-write the C++/cuda kernels.

I compared memory consumption vs using a python loop with tensordots (tt-pytorch implementation) and einsum is better.

Can't publish the full code yet because its under active research. We changed the TT implementation quite a lot from the public github repos and used 4-dimensional tensors as tt.cores. (Note that in this repo the authors "squeezed" the cores into 2-dimensional tensors, to use simple matmuls).

something like torch.einsum('adcbr,rdxk,kcym,mbzn->axyzn', x, *tt.cores) does the job for d=3. Notice it depends on the shape of tt.cores. Can implement something similar for 2-d cores.

I hope that when the research is done it will be published as part of a paper or integrated to nlp-architect.