lucidrains / performer-pytorch

An implementation of Performer, a linear attention-based transformer, in Pytorch
MIT License
1.08k stars 141 forks source link

use performer for image detection #12

Closed madurner closed 3 years ago

madurner commented 3 years ago

Hello @lucidrains, thanks for this great work!

I am currently working on image detection with the DETR transformers and have issues to train it from scratch (mainly because of GPU resources^^). So I was looking around how to improve the efficiency and found the "Rethinking Attention with Performers" paper. At the moment I am getting into the paper and think I understand the main concept :P So I was wondering if it is possible to exchange the attention layers of the DETR with the Performer-layers? Do you think this is possible and can solve my problem of training the DETR transformer from scratch?

lucidrains commented 3 years ago

@madurner I think it is worth trying, only if you do not expect performances to be as good as with full attention

lucidrains commented 3 years ago

@madurner the other thing you could try is reversible networks, from the Reformer paper. i've used it for so many projects and can attest that it works well. that will save you a depth multiple of the memory being used, at the cost of at least 2x the speed

lucidrains commented 3 years ago

@madurner also give these two attention-like architectures a try https://github.com/lucidrains/global-self-attention-network and https://github.com/lucidrains/lambda-networks

both uses performer-like linear attention

madurner commented 3 years ago

@madurner I think it is worth trying, only if you do not expect performances to be as good as with full attention

Hey, thanks for your reply! From the paper, it seems, that with the right training they achieve similar results as the transformer. Do you have other experiences?

@madurner the other thing you could try is reversible networks, from the Reformer paper. i've used it for so many projects and can attest that it works well. that will save you a depth multiple of the memory being used, at the cost of at least 2x the speed

Thx for this advice :) I'll check out the paper. From your experience, does the reversible network approach increase the training efficiency?

I was also reading the deformable detr paper. What do you think about that?

lucidrains commented 3 years ago

@madurner there's a lot of benchmarks lacking in the paper. i think you should just try training it and report back your findings

reversibility is good! that's one great takeaway from the Reformer paper, and i use it everywhere. it'll save you a ton of memory, at the cost of being slower in training

yeah, DETR is the way to go. attention is all you need :)

madurner commented 3 years ago

@lucidrains sorry for bothering you again 🙈

reversibility is good! that's one great takeaway from the Reformer paper, and i use it everywhere. it'll save you a ton of memory, at the cost of being slower in training

By slower training do you mean the time or the number of runs, since this is the big bottleneck I see in DETR?

lucidrains commented 3 years ago

@madurner oh sorry, brain fart :D yea, i think you'll be perfectly satisfied with simply switching from full attention to linear (performer) attention

reversibility is when you want to trade off less memory for more compute, which you don't need