hszhao / SAN

Exploring Self-attention for Image Recognition, CVPR2020.
MIT License
747 stars 133 forks source link

customized subtraction and aggregation implement much slower than the pytorch implementation #13

Open mzy97 opened 4 years ago

mzy97 commented 4 years ago

enviroment: pytorch 1.5.1 cuda 10.1 test on small input tensor (2,8,5,5)

when using the test method in lib/sa/functions to test the speed, I found that the corresponding implementation using pytorch api is much faster than your C code in backward propogation (about 50X faster). Although the forward times of them are relatively close, customized api is slightly faster than torch api.

image

So why you choose to implement the operation on your own?

luogen1996 commented 3 years ago

enviroment: pytorch 1.5.1 cuda 10.1 test on small input tensor (2,8,5,5)

when using the test method in lib/sa/functions to test the speed, I found that the corresponding implementation using pytorch api is much faster than your C code in backward propogation (about 50X faster). Although the forward times of them are relatively close, customized api is slightly faster than torch api.

image

So why you choose to implement the operation on your own?

That's interesting. Because aggregation is an inregular operation. I have tried to implement it using torch.einsum, but the time cost and memory cost are much more huge than the customied api. I'm interested in your implementation. could you provide the corresponding pytorch code?