Longformer, BigBird take same time to run in sparse mode as well as full-mode

allohvk commented 2 years ago

System Info

Transformers: 4.20.1 Python: 3.8.12 Pretrained models & tokenizer from HF: "allenai/longformer-base-4096" and "google/bigbird-roberta-base"

Longformer: Take same time to train (finetume) a pretrained model for different sliding window sizes of 256, 512, 1024 or 2048. One would expect that at lower sliding window sizes, the training times should be lower.

BigBird: Same problem as above. In fact BigBird has a simple switch to change from sparse-attention to full-attention. The training time taken in both cases is roughly the same which seems to point to some issue.

Small but complete source code to simulate: https://colab.research.google.com/drive/1nm7a-qJseNSCkAB5_3QNkVSrHc8zePAV?usp=sharing

Who can help?

@ydshieh

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

https://colab.research.google.com/drive/1nm7a-qJseNSCkAB5_3QNkVSrHc8zePAV?usp=sharing

Expected behavior

Longformer: Take different time to train (finetume) a pretrained model for different sliding window sizes of 256, 512, 1024 or 2048. One would expect that at lower sliding window sizes, the training times should be lower.

BigBird: Same problem as above. In fact BigBird has a simple switch to change from sparse-attention to full-attention. The training time taken in both cases is roughly the same which seems to point to some issue.

allohvk commented 2 years ago

@ydshieh It gets a bit more wierder. Today I tried to directly use the Longformer bypassing Huggingface.

It needed minor changes to the above code. The link is here: https://colab.research.google.com/drive/1R5uDsbl3ZmUIccZtefVNBs3CXU_vcDZd?usp=sharing

The observations continues to be perplexing: CASE 1: ATT_MODE = 'sliding_chunks'; 100% LOCAL attention ie attention_mask = 1 for all tokens SLIDE_WIN_SIZE = 256(default) takes between 9-10 hours to train SLIDE_WIN_SIZE = 1024 takes between 9-10 hours to train Observation: Sparse attention with 256 tokens windowsize should not take same fine-tuning time as 1024 tokens

CASE 2: ATT_MODE = 'sliding_chunks'; NO attention ie attention_mask = 0 for all tokens SLIDE_WIN_SIZE is immaterial Observation: It is observed that even if none of tokens attend to each other, training time taken is same as case 1 above ie 9-10 hours which should not be the case

CASE 3: ATT_MODE = 'sliding_chunks'; 100% Global attention: ie attention_mask = 2 SLIDE_WIN_SIZE is immaterial Observation: With 100% global attention, every token attends to each other. It is observed that if all tokens attend to each other, training time taken is 16-17 hours. This training time should be similar to Case 4 which is NOT the case

Case 4: This is the most bizzarre ATT_MODE = 'n2' We can simply set choose the attention mode = 'n2' which is regular quadratic attention. Theoritically this should take same training time as Case 3 (when all tokens are marked as global) Observation: n2 attention takes the lowest training time of approx 2 hours only which is exact opposite of what LOngformer is supposed to do !!!

Should I open a bug directly with the Longformer GITHUB?

ydshieh commented 2 years ago

Hi @allohvk

After doing some experiments, I think we need really long sequences and attention window size to see the benefits of attention window size. Here is the main summary, which is from the 2 tables below:

Summary

with tiny model, the effect of attention window size is more clear, especially on CPU
- large model size has more overhead on other layers (for example, intermediate linear layers)
for a fixed model size, the effect is even more clear when the max_len get larger
with GPU, (which is very fast), the effect is less clear, but we can still see it with very long sequence/att_win (16384)

Model size

Tiny: n_layers = 1, hidden_size = 1, intermediate_size = 1
Base: n_layers = 12, hidden_size = 256, intermediate_size = 1024
Large: n_layers = 24, hidden_size = 1024, intermediate_size = 4096

⚠️ (Be careful with it/s and s/it below)

CPU (256G RAM)

CPU	Tiny	Base	Large
max_len 2048 , attn_win 512	19.74 it/s	1.02 s/it	5.92 s/it
max_len 2048 , attn_win 1024	14.42 it/s	1.25 s/it	6.47 s/it
max_len 2048 , attn_win 2048	13.25 it/s	1.48 s/it	6.69 s/it
max_len 4096, attn_win 512	16.55 it/s	1.61 s/it	10.31 s/it
max_len 4096, attn_win 1024	10.00 it/s	2.20 s/it	11.29 s/it
max_len 4096, attn_win 2048	4.84 it/s	3.85 s/it	13.47 s/it
max_len 4096, attn_win 4096	3.18 it/s	6.15 s/it	15.49 s/it
max_len 16384, attn_win 512	3.51 it/s	5.61 s/it	42.33 s/it
max_len 16384, attn_win 1024	2.03 it/s	8.08 s/it	48.13 s/it
max_len 16384, attn_win 2048	1.12 it/s	12.03 s/it	56.93 s/it
max_len 16384, attn_win 4096	1.62 s/it	20.22 s/it	87.87 s/it
max_len 16384, attn_win 8192	3.02 s/it	34.67 s/it	131.81 s/it
max_len 16384, attn_win 16384	5.00 s/it	56.79 s/it	187.91 s/it

GPU (A100)

GPU	Tiny	Base	Large
max_len 2048 , attn_win 512	25.48 it/s	5.15 it/s	2.57 it/s
max_len 2048 , attn_win 1024	26.33 it/s	5.10 it/s	2.42 it/s
max_len 2048 , attn_win 2048	26.52 it/s	5.09 it/s	2.10 it/s
max_len 4096, attn_win 512	25.55 it/s	5.26 it/s	2.32 it/s
max_len 4096, attn_win 1024	25.73 it/s	5.10 it/s	2.01 it/s
max_len 4096, attn_win 2048	24.23 it/s	4.63 it/s	1.52 it/s
max_len 4096, attn_win 4096	21.30 it/s	3.76 it/s	1.05 it/s
max_len 16384, attn_win 512	7.39 it/s	4.24 it/s	1.07 it/s
max_len 16384, attn_win 1024	13.30 it/s	3.37 it/s	1.25 s/it
max_len 16384, attn_win 2048	20.17 it/s	2.33 it/s	1.88 s/it
max_len 16384, attn_win 4096	16.50 it/s	1.44 it/s	N/A
max_len 16384, attn_win 8192	13.46 it/s	1.21 s/it	N/A
max_len 16384, attn_win 16384	9.04 it/s	2.16 s/it	N/A

ydshieh commented 2 years ago

For the record, here are the 2 scripts I used to measure running time (copied from yours with modification)

python run.py

run.py


import os
import json

def run(attention_window, steps, batch_size, max_length):

    os.system("rm -rf output.txt")
    os.system(f"python debug.py {attention_window} {steps} {batch_size} {max_length} > output.txt 2>&1")

    with open("output.txt") as fp:
        for line in fp:
            if f"{steps - 1}/{steps}" in line:
                line = line.strip()
                idx = line.find(f"{steps - 1}/{steps}")
                line = line[idx:]
                if "Initializing global" in line:
                    idx = line.find("Initializing global")
                    line = line[:idx]
                    line = line.strip()
                return line

res = {}
steps = 10

for batch_size in [1]:
    for max_length in [2048, 4096, 16384]:
        for attention_window in [512, 1024, 2048, 4096, 8192, 16384]:
            if attention_window > max_length:
                continue
            r = run(attention_window=attention_window, steps=steps, batch_size=batch_size, max_length=max_length)
            print(f"(attn_win: {attention_window}, batch_size: {batch_size}, max_len: {max_length}) --> {r}")
            print("=" * 40)

            res[f"(attn_win: {attention_window}, batch_size: {batch_size}, max_len: {max_length})"] = r

            with open("results.json", "w") as fp:
                json.dump(res, fp, indent=4, ensure_ascii=False)

debug.py

import sys

import torch
import datasets
import transformers
from transformers import BigBirdForSequenceClassification, Trainer, TrainingArguments, AutoTokenizer, AutoModel
from transformers.models.longformer.modeling_longformer import LongformerForSequenceClassification, LongformerConfig
from sklearn.metrics import accuracy_score
from torch.utils.data import Dataset

import logging
# logging.disable(logging.INFO)

def measure(attention_window, steps, batch_size, max_length):

    SLIDE_WIN_SIZE = attention_window

    STEPS = steps
    BATCH_SIZE = batch_size
    GRAD_ACCUMULATION_STEPS = 1
    LEN = max_length

    MODEL = 'allenai/longformer-base-4096'
    LONGFORMER = True
    CACHE_ROOT = "./"

    train_data, test_data = datasets.load_dataset('imdb', split=['train', 'test'], cache_dir=f'{CACHE_ROOT}/data')

    config = LongformerConfig.from_pretrained(MODEL, num_labels=2, return_dict=True)

    config.num_hidden_layers = 12
    config.hidden_size = 256
    config.num_attention_heads = 1
    config.intermediate_size = 1024

    config.attention_window = SLIDE_WIN_SIZE

    model = LongformerForSequenceClassification(config=config)
    tokenizer = AutoTokenizer.from_pretrained(MODEL, max_length=LEN, cache_dir=f'{CACHE_ROOT}/data')

    print("DEFAULT - Sliding window width across layers", model.config.attention_window)
    model.config.attention_window = SLIDE_WIN_SIZE
    print("UPDATED - Sliding window width across layers", model.config.attention_window)

    def tokenization(batched_text):
        return tokenizer(batched_text['text'], padding = 'max_length', truncation=True, max_length = LEN)

    train_data = train_data.map(tokenization, batched = True, batch_size = len(train_data))
    test_data = test_data.map(tokenization, batched = True, batch_size = len(test_data))

    train_data.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
    test_data.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])

    def compute_metrics(pred):
        labels = pred.label_ids
        preds = pred.predictions.argmax(-1)
        acc = accuracy_score(labels, preds)
        return {'accuracy': acc}

    training_args = TrainingArguments(
        output_dir=f'{CACHE_ROOT}/results',
        # num_train_epochs=1,
        per_device_train_batch_size=BATCH_SIZE,
        max_steps=STEPS,
        gradient_accumulation_steps=GRAD_ACCUMULATION_STEPS,
        warmup_steps=160,
        weight_decay=0.01,
        learning_rate=2e-5,
        fp16=False,  # True,
        dataloader_num_workers=2,
        logging_strategy="steps",
        logging_steps=1,
    )

    trainer = Trainer(model=model, args=training_args, compute_metrics=compute_metrics, train_dataset=train_data)
    trainer.train()

if __name__ == "__main__":

    data = sys.argv[1:]
    print(data)
    data = [int(x) for x in data]
    measure(*data)

allohvk commented 2 years ago

Thank you so much @ydshieh The observations are fascinating. One would think they would be part of the actual BigBird and Longformer papers but they are not. The benefits of changing the hyperparameters like sliding_window, global tokens etc manifest at really high seq sizes (not 2048 or even 4096 but 8000 or 16000). Because I was testing on a GPU and at a size of 2048, I could hardly see any difference. Thank you for your detailed testing and observations.

In fact this means that there is a gap to squeeze in couple of new transformer models/white papers which specifically address the max_seqlen of 512 - 4096 space in a non-quadratic way such that it makes a meaningful difference in training time. Hope someone comes out with a new model soon :)

omarwbassam commented 1 year ago

No good

krstp commented 1 week ago

Great post. Thanks for sharing all the insights. Would you be able to share what was you input dataset size you were using, in particular number of rows and size of the text (tokens/words) you were using?

I was trying something similar on ~1mln input with large number of labels, initially with 4096 window, and 15 hours do not suffice to even train a single epoch.

kiranvholla commented 1 day ago

Great post. Thanks for sharing all the insights. Would you be able to share what was you input dataset size you were using, in particular number of rows and size of the text (tokens/words) you were using?

I was trying something similar on ~1mln input with large number of labels, initially with 4096 window, and 15 hours do not suffice to even train a single epoch.

Apologies for the delayed response. It is almost 2 year old thread. I believe the dataset was for the Kaggle AI for code challenge here - https://www.kaggle.com/competitions/AI4Code. You can look at my writeup for details: https://www.kaggle.com/competitions/AI4Code/discussion/367140

huggingface / transformers