google-research / bigbird

Transformers for Longer Sequences
https://arxiv.org/abs/2007.14062
Apache License 2.0
563 stars 101 forks source link

I've added bigbird's attention to my model, but not seeing a decrease in memory #33

Open Currie32 opened 2 years ago

Currie32 commented 2 years ago

I've replaced the attention layers in Enformer with those in bigbird, but the memory usage calculated by tf.get_memory_info shows the usage is still basically the same (within 1%). I'm wondering if I need to include code from the encoder or decoder to see a decrease in memory usage?

Thanks!

ppham27 commented 2 years ago

To clarify, you are using https://github.com/google-research/bigbird/blob/5f2a5aa7fbab23e32e0e0b41c5f0192f0c023e05/bigbird/core/attention.py#L637 with attention_type = 'block_sparse' ?

What's your sequence length ?

Currie32 commented 2 years ago

Correct, I'm using that class with block_sparse attention. When the sequence enters the attention layer, its length is 1536.

ppham27 commented 2 years ago

I see. Does the memory used change with sequence length?

I don't suppose your are using XLA? BigBird can be as much as 30% faster with tf.function(jit_compile=True). It also produces better memory profiles that make it easier to debug.

Currie32 commented 2 years ago

Yes, the memory used increases with sequence length.

I'm not using XLA, and thanks for the tip!

ppham27 commented 2 years ago

https://www.tensorflow.org/guide/profiler#memory_profile_tool may also be useful. The XLA memory viewer (https://cloud.google.com/tpu/docs/pytorch-xla-performance-profiling-tpu-vm#memory_viewer) is better but both are useful.