Pay Less Attention with Lightweight and Dynamic Convolutions - Githubissues

kweonwooj / papers

summary of ML papers I've read

319 stars 34 forks source link

Pay Less Attention with Lightweight and Dynamic Convolutions #120

Open kweonwooj opened 5 years ago

kweonwooj commented 5 years ago

Abstract

self-attention is strong, but its effect on long-range dependency is in question
propose lightweight convolution and dynamic convolution, a convolution as a function of timestep which is lightweight and cost is linear in input length + performs better or on-par with self-attention in machine translation, summarization and language modeling
in machine translation, WMT14 EnDe SoTA of 29.7 BLEU

Details

Background

Depth-wise Convolution : performs convolution independently over every channel
Light-weight Convolution
weight sharing (H = 16) + softmax normalized depth-wise convolution
DropConnect is used for regularization
Dynamic Convolution
timestep dependent kernel function + light-weight convolution
Overall Structure

Results

DynamicConv achieves 29.7 BLEU on WMT EnDe with same param count as Transformer Big
Ablation
- speed is 20% faster with DynamicConv

Personal Thoughts

impressive result, improving both performance and speed against Transformer
I wonder what timestep dependent kernel is capturing
will the performance with small number of layers be equivalent? because CNN seem to gather contextual information via stacking, whereas self-attention can obtain global context in single operation

Link : https://openreview.net/pdf?id=SkVhlh09tX Authors : Wu et al. 2018

demdecuong commented 4 years ago

As i guess, We have GLU to expand the dimension into nx2d then we go to conv to rescale it into nxd right ? I still dont understand how to apply the softmax in LightweightConv . We would softmax all the kernel weight of the Conv layer right ?

Moreover, i am not clear about the weight-sharing of the author since i try to re-implement this architecture.

Please give me some explanation .

Thank you so much.