Closed allohvk closed 2 years ago
@ydshieh It gets a bit more wierder. Today I tried to directly use the Longformer bypassing Huggingface.
It needed minor changes to the above code. The link is here: https://colab.research.google.com/drive/1R5uDsbl3ZmUIccZtefVNBs3CXU_vcDZd?usp=sharing
The observations continues to be perplexing: CASE 1: ATT_MODE = 'sliding_chunks'; 100% LOCAL attention ie attention_mask = 1 for all tokens SLIDE_WIN_SIZE = 256(default) takes between 9-10 hours to train SLIDE_WIN_SIZE = 1024 takes between 9-10 hours to train Observation: Sparse attention with 256 tokens windowsize should not take same fine-tuning time as 1024 tokens
CASE 2: ATT_MODE = 'sliding_chunks'; NO attention ie attention_mask = 0 for all tokens SLIDE_WIN_SIZE is immaterial Observation: It is observed that even if none of tokens attend to each other, training time taken is same as case 1 above ie 9-10 hours which should not be the case
CASE 3: ATT_MODE = 'sliding_chunks'; 100% Global attention: ie attention_mask = 2 SLIDE_WIN_SIZE is immaterial Observation: With 100% global attention, every token attends to each other. It is observed that if all tokens attend to each other, training time taken is 16-17 hours. This training time should be similar to Case 4 which is NOT the case
Case 4: This is the most bizzarre ATT_MODE = 'n2' We can simply set choose the attention mode = 'n2' which is regular quadratic attention. Theoritically this should take same training time as Case 3 (when all tokens are marked as global) Observation: n2 attention takes the lowest training time of approx 2 hours only which is exact opposite of what LOngformer is supposed to do !!!
Should I open a bug directly with the Longformer GITHUB?
Hi @allohvk
After doing some experiments, I think we need really long sequences and attention window size to see the benefits of attention window size. Here is the main summary, which is from the 2 tables below:
max_len
get larger⚠️ (Be careful with it/s
and s/it
below)
CPU | Tiny | Base | Large |
---|---|---|---|
max_len 2048 , attn_win 512 | 19.74 it/s | 1.02 s/it | 5.92 s/it |
max_len 2048 , attn_win 1024 | 14.42 it/s | 1.25 s/it | 6.47 s/it |
max_len 2048 , attn_win 2048 | 13.25 it/s | 1.48 s/it | 6.69 s/it |
max_len 4096, attn_win 512 | 16.55 it/s | 1.61 s/it | 10.31 s/it |
max_len 4096, attn_win 1024 | 10.00 it/s | 2.20 s/it | 11.29 s/it |
max_len 4096, attn_win 2048 | 4.84 it/s | 3.85 s/it | 13.47 s/it |
max_len 4096, attn_win 4096 | 3.18 it/s | 6.15 s/it | 15.49 s/it |
max_len 16384, attn_win 512 | 3.51 it/s | 5.61 s/it | 42.33 s/it |
max_len 16384, attn_win 1024 | 2.03 it/s | 8.08 s/it | 48.13 s/it |
max_len 16384, attn_win 2048 | 1.12 it/s | 12.03 s/it | 56.93 s/it |
max_len 16384, attn_win 4096 | 1.62 s/it | 20.22 s/it | 87.87 s/it |
max_len 16384, attn_win 8192 | 3.02 s/it | 34.67 s/it | 131.81 s/it |
max_len 16384, attn_win 16384 | 5.00 s/it | 56.79 s/it | 187.91 s/it |
GPU | Tiny | Base | Large |
---|---|---|---|
max_len 2048 , attn_win 512 | 25.48 it/s | 5.15 it/s | 2.57 it/s |
max_len 2048 , attn_win 1024 | 26.33 it/s | 5.10 it/s | 2.42 it/s |
max_len 2048 , attn_win 2048 | 26.52 it/s | 5.09 it/s | 2.10 it/s |
max_len 4096, attn_win 512 | 25.55 it/s | 5.26 it/s | 2.32 it/s |
max_len 4096, attn_win 1024 | 25.73 it/s | 5.10 it/s | 2.01 it/s |
max_len 4096, attn_win 2048 | 24.23 it/s | 4.63 it/s | 1.52 it/s |
max_len 4096, attn_win 4096 | 21.30 it/s | 3.76 it/s | 1.05 it/s |
max_len 16384, attn_win 512 | 7.39 it/s | 4.24 it/s | 1.07 it/s |
max_len 16384, attn_win 1024 | 13.30 it/s | 3.37 it/s | 1.25 s/it |
max_len 16384, attn_win 2048 | 20.17 it/s | 2.33 it/s | 1.88 s/it |
max_len 16384, attn_win 4096 | 16.50 it/s | 1.44 it/s | N/A |
max_len 16384, attn_win 8192 | 13.46 it/s | 1.21 s/it | N/A |
max_len 16384, attn_win 16384 | 9.04 it/s | 2.16 s/it | N/A |
For the record, here are the 2 scripts I used to measure running time (copied from yours with modification)
python run.py
import os
import json
def run(attention_window, steps, batch_size, max_length):
os.system("rm -rf output.txt")
os.system(f"python debug.py {attention_window} {steps} {batch_size} {max_length} > output.txt 2>&1")
with open("output.txt") as fp:
for line in fp:
if f"{steps - 1}/{steps}" in line:
line = line.strip()
idx = line.find(f"{steps - 1}/{steps}")
line = line[idx:]
if "Initializing global" in line:
idx = line.find("Initializing global")
line = line[:idx]
line = line.strip()
return line
res = {}
steps = 10
for batch_size in [1]:
for max_length in [2048, 4096, 16384]:
for attention_window in [512, 1024, 2048, 4096, 8192, 16384]:
if attention_window > max_length:
continue
r = run(attention_window=attention_window, steps=steps, batch_size=batch_size, max_length=max_length)
print(f"(attn_win: {attention_window}, batch_size: {batch_size}, max_len: {max_length}) --> {r}")
print("=" * 40)
res[f"(attn_win: {attention_window}, batch_size: {batch_size}, max_len: {max_length})"] = r
with open("results.json", "w") as fp:
json.dump(res, fp, indent=4, ensure_ascii=False)
import sys
import torch
import datasets
import transformers
from transformers import BigBirdForSequenceClassification, Trainer, TrainingArguments, AutoTokenizer, AutoModel
from transformers.models.longformer.modeling_longformer import LongformerForSequenceClassification, LongformerConfig
from sklearn.metrics import accuracy_score
from torch.utils.data import Dataset
import logging
# logging.disable(logging.INFO)
def measure(attention_window, steps, batch_size, max_length):
SLIDE_WIN_SIZE = attention_window
STEPS = steps
BATCH_SIZE = batch_size
GRAD_ACCUMULATION_STEPS = 1
LEN = max_length
MODEL = 'allenai/longformer-base-4096'
LONGFORMER = True
CACHE_ROOT = "./"
train_data, test_data = datasets.load_dataset('imdb', split=['train', 'test'], cache_dir=f'{CACHE_ROOT}/data')
config = LongformerConfig.from_pretrained(MODEL, num_labels=2, return_dict=True)
config.num_hidden_layers = 12
config.hidden_size = 256
config.num_attention_heads = 1
config.intermediate_size = 1024
config.attention_window = SLIDE_WIN_SIZE
model = LongformerForSequenceClassification(config=config)
tokenizer = AutoTokenizer.from_pretrained(MODEL, max_length=LEN, cache_dir=f'{CACHE_ROOT}/data')
print("DEFAULT - Sliding window width across layers", model.config.attention_window)
model.config.attention_window = SLIDE_WIN_SIZE
print("UPDATED - Sliding window width across layers", model.config.attention_window)
def tokenization(batched_text):
return tokenizer(batched_text['text'], padding = 'max_length', truncation=True, max_length = LEN)
train_data = train_data.map(tokenization, batched = True, batch_size = len(train_data))
test_data = test_data.map(tokenization, batched = True, batch_size = len(test_data))
train_data.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
test_data.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
def compute_metrics(pred):
labels = pred.label_ids
preds = pred.predictions.argmax(-1)
acc = accuracy_score(labels, preds)
return {'accuracy': acc}
training_args = TrainingArguments(
output_dir=f'{CACHE_ROOT}/results',
# num_train_epochs=1,
per_device_train_batch_size=BATCH_SIZE,
max_steps=STEPS,
gradient_accumulation_steps=GRAD_ACCUMULATION_STEPS,
warmup_steps=160,
weight_decay=0.01,
learning_rate=2e-5,
fp16=False, # True,
dataloader_num_workers=2,
logging_strategy="steps",
logging_steps=1,
)
trainer = Trainer(model=model, args=training_args, compute_metrics=compute_metrics, train_dataset=train_data)
trainer.train()
if __name__ == "__main__":
data = sys.argv[1:]
print(data)
data = [int(x) for x in data]
measure(*data)
Thank you so much @ydshieh The observations are fascinating. One would think they would be part of the actual BigBird and Longformer papers but they are not. The benefits of changing the hyperparameters like sliding_window, global tokens etc manifest at really high seq sizes (not 2048 or even 4096 but 8000 or 16000). Because I was testing on a GPU and at a size of 2048, I could hardly see any difference. Thank you for your detailed testing and observations.
In fact this means that there is a gap to squeeze in couple of new transformer models/white papers which specifically address the max_seqlen of 512 - 4096 space in a non-quadratic way such that it makes a meaningful difference in training time. Hope someone comes out with a new model soon :)
No good
Great post. Thanks for sharing all the insights. Would you be able to share what was you input dataset size you were using, in particular number of rows and size of the text (tokens/words) you were using?
I was trying something similar on ~1mln input with large number of labels, initially with 4096 window, and 15 hours do not suffice to even train a single epoch.
Great post. Thanks for sharing all the insights. Would you be able to share what was you input dataset size you were using, in particular number of rows and size of the text (tokens/words) you were using?
I was trying something similar on ~1mln input with large number of labels, initially with 4096 window, and 15 hours do not suffice to even train a single epoch.
Apologies for the delayed response. It is almost 2 year old thread. I believe the dataset was for the Kaggle AI for code challenge here - https://www.kaggle.com/competitions/AI4Code. You can look at my writeup for details: https://www.kaggle.com/competitions/AI4Code/discussion/367140
System Info
Transformers: 4.20.1 Python: 3.8.12 Pretrained models & tokenizer from HF: "allenai/longformer-base-4096" and "google/bigbird-roberta-base"
Longformer: Take same time to train (finetume) a pretrained model for different sliding window sizes of 256, 512, 1024 or 2048. One would expect that at lower sliding window sizes, the training times should be lower.
BigBird: Same problem as above. In fact BigBird has a simple switch to change from sparse-attention to full-attention. The training time taken in both cases is roughly the same which seems to point to some issue.
Small but complete source code to simulate: https://colab.research.google.com/drive/1nm7a-qJseNSCkAB5_3QNkVSrHc8zePAV?usp=sharing
Who can help?
@ydshieh
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
https://colab.research.google.com/drive/1nm7a-qJseNSCkAB5_3QNkVSrHc8zePAV?usp=sharing
Expected behavior
Longformer: Take different time to train (finetume) a pretrained model for different sliding window sizes of 256, 512, 1024 or 2048. One would expect that at lower sliding window sizes, the training times should be lower.
BigBird: Same problem as above. In fact BigBird has a simple switch to change from sparse-attention to full-attention. The training time taken in both cases is roughly the same which seems to point to some issue.