('full', 'axial_row', 'axial_col', 'conv_like') works, but uses more memory than just 'full'

afiaka87 commented 3 years ago

# Original post by @kobiso https://github.com/lucidrains/DALLE-pytorch/discussions/131#discussioncomment-640446

Go give them a rocket emoji!

Attention type ('full', 'axial_row', 'axial_col', 'conv_like') works

Above experiment is trained with attention type ('full', 'sparse'), where DeepSpeed sparse attention gives advantages of memory usage and training speed.
However, OpenAI's DALL-E includes row, column, convolutional attention to the model as below.
Thankfully, these attention types are implemented as in README.
So, I did an experiment with attention type ('full', 'axial_row', 'axial_col', 'conv_like').

Experimental setting

DALLE-pytorch version: 0.7.2
Attention type: ('full', 'axial_row', 'axial_col', 'conv_like')
Batch size: 32 * 7 (gpus)
Others are the same as https://github.com/lucidrains/DALLE-pytorch/discussions/131#discussion-3296648
Pretrained model is in https://github.com/kobiso/DALLE-reproduction

Computational cost

Training speed: training speed was decreased by 38% compared to ('full', 'sparse')
Memory consumption: had to reduce from 110 ('full', 'sparse') to 32 ('full', 'axial_row', 'axial_col', 'conv_like') per gpu

Training log

Results

I feel like ('full', 'sparse') is little better than ('full', 'axial_row', 'axial_col', 'conv_like') on generation performance.
But still, ('full', 'axial_row', 'axial_col', 'conv_like') does work 👍

Originally posted by @kobiso in https://github.com/lucidrains/DALLE-pytorch/discussions/131#discussioncomment-640446

lucidrains commented 3 years ago

@afiaka87 nice, glad to hear it is working! :)

afiaka87 commented 3 years ago

@kobiso @lucidrains @janEbert can any of you speak as to why the axial and conv-like attention seem to require so much more memory than using just 'full' on its own?

my understanding was that these layers operated on more efficient feature maps. but i may need to revisit the topic.

lucidrains commented 3 years ago

ohh I understand why, I'll get it fixed by Friday!

TheodoreGalanos commented 3 years ago

ohh I understand why, I'll get it fixed by Friday!

does that mean I should wait for training a new model (i.e. breaking changes?) or is it safe to do so :)

lucidrains commented 3 years ago

@TheodoreGalanos it won't be breaking! It'll simply be more memory efficient attention , train away :)

afiaka87 commented 3 years ago

Great work @lucidrains

lucidrains / DALLE-pytorch