Closed madurner closed 3 years ago
@madurner I think it is worth trying, only if you do not expect performances to be as good as with full attention
@madurner the other thing you could try is reversible networks, from the Reformer paper. i've used it for so many projects and can attest that it works well. that will save you a depth multiple of the memory being used, at the cost of at least 2x the speed
@madurner also give these two attention-like architectures a try https://github.com/lucidrains/global-self-attention-network and https://github.com/lucidrains/lambda-networks
both uses performer-like linear attention
@madurner I think it is worth trying, only if you do not expect performances to be as good as with full attention
Hey, thanks for your reply! From the paper, it seems, that with the right training they achieve similar results as the transformer. Do you have other experiences?
@madurner the other thing you could try is reversible networks, from the Reformer paper. i've used it for so many projects and can attest that it works well. that will save you a depth multiple of the memory being used, at the cost of at least 2x the speed
Thx for this advice :) I'll check out the paper. From your experience, does the reversible network approach increase the training efficiency?
I was also reading the deformable detr paper. What do you think about that?
@madurner there's a lot of benchmarks lacking in the paper. i think you should just try training it and report back your findings
reversibility is good! that's one great takeaway from the Reformer paper, and i use it everywhere. it'll save you a ton of memory, at the cost of being slower in training
yeah, DETR is the way to go. attention is all you need :)
@lucidrains sorry for bothering you again 🙈
reversibility is good! that's one great takeaway from the Reformer paper, and i use it everywhere. it'll save you a ton of memory, at the cost of being slower in training
By slower training do you mean the time or the number of runs, since this is the big bottleneck I see in DETR?
@madurner oh sorry, brain fart :D yea, i think you'll be perfectly satisfied with simply switching from full attention to linear (performer) attention
reversibility is when you want to trade off less memory for more compute, which you don't need
Hello @lucidrains, thanks for this great work!
I am currently working on image detection with the DETR transformers and have issues to train it from scratch (mainly because of GPU resources^^). So I was looking around how to improve the efficiency and found the "Rethinking Attention with Performers" paper. At the moment I am getting into the paper and think I understand the main concept :P So I was wondering if it is possible to exchange the attention layers of the DETR with the Performer-layers? Do you think this is possible and can solve my problem of training the DETR transformer from scratch?