Significant speedups - Githubissues

michaelfeil commented 4 months ago

@iliaschalkidis This might be interesting to you:

Running newer models e.g. BAAI/bge-m3 and limiting the tokens to be >= 2048 tokens reduces the memory footprint a lot. You have previously assumed otherwise, but looking at the flash-attn paper, its mostly about sequence lengths > 512 tokens.

In my case, roberta-large with 2048 ctx length:

Mem: 37000MiB -> 12810MiB.
Speed: 108s -> 89s

iliaschalkidis commented 4 months ago

Hi @michaelfeil, that's really interesting. Could you provide some logs from the developed scripts or updated versions that you have? Is this a drastic difference of the Flash Attention 2 release?

michaelfeil commented 4 months ago

I am using even non-nvidia version of flash-attention https://github.com/ROCm/flash-attention\ Let me instuct you how to reproduce it on your machine:

Swap out roberta-base to https://huggingface.co/BAAI/bge-m3-unsupervised/blob/main/config.json (8192 context length, but no modeling head). Modeling head will be created on the fly from random weights. Filter to token lengths in C4 as you would like ( 512 < x < 1024) should give between 512 and ~4k tokens. In my case I patch the tokenization max_length Run the benchmark (have some vRAM available to run the non-flash-version).

iliaschalkidis / flash-roberta

Significant speedups #2