pytorch / xla

Enabling PyTorch on XLA Devices (e.g. Google TPU)
https://pytorch.org/xla
Other
2.48k stars 480 forks source link

Encountering out-of-memory errors despite using modest model and batch sizes. #6948

Open seanswyi opened 6 months ago

seanswyi commented 6 months ago

❓ Questions and Help

I'm trying to run a simple text classification task using HuggingFace Transformers and BERT. My background's in NLP but I wanted to run a simple tutorial to get used to using TPUs rather than GPUs. The tutorial is this: Fine-tune a pretrained model.

The code can fit into one script:

import evaluate
import torch
import torch_xla.core.xla_model as xm
from datasets import load_dataset
from torch.optim import AdamW
from torch.utils.data import DataLoader
from tqdm import tqdm, trange
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    get_scheduler,
)

def main():
    dataset = load_dataset("yelp_review_full")
    tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

    def tokenize_function(examples):
        return tokenizer(
            examples["text"],
            padding="max_length",
            truncation=True,
            return_tensors="pt",
        )

    tokenized_datasets = dataset.map(tokenize_function, batched=True)
    tokenized_datasets = tokenized_datasets.remove_columns(["text"])
    tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
    tokenized_datasets.set_format("torch")

    train_dataset = tokenized_datasets["train"]
    test_dataset = tokenized_datasets["test"]

    train_dataloader = DataLoader(
        dataset=train_dataset,
        batch_size=6,
        shuffle=True,
    )
    test_dataloader = DataLoader(
        dataset=test_dataset,
        batch_size=6,
    )

    model = AutoModelForSequenceClassification.from_pretrained(
        "bert-base-cased", num_labels=5
    )
    optimizer = AdamW(model.parameters(), lr=5e-5)

    num_epochs = 3
    num_training_steps = num_epochs * len(train_dataloader)
    lr_scheduler = get_scheduler(
        name="linear",
        optimizer=optimizer,
        num_warmup_steps=0,
        num_training_steps=num_training_steps,
    )

    device = xm.xla_device()

    model = model.to(device)

    epoch_pbar = trange(
        num_epochs,
        desc="Epochs",
        total=num_epochs,
    )
    for epoch in epoch_pbar:
        model.train()

        train_pbar = tqdm(
            iterable=train_dataloader,
            desc="Training",
            total=len(train_dataloader),
        )
        for batch in train_pbar:
            batch = {k: v.to(device) for k, v in batch.items()}
            outputs = model(**batch)
            loss = outputs.loss
            loss.backward()

            xm.optimizer_step(optimizer)
            lr_scheduler.step()
            optimizer.zero_grad()

        model.eval()

        metric = evaluate.load("accuracy")
        eval_pbar = tqdm(
            iterable=test_dataloader,
            desc="Evaluating",
            total=len(test_dataloader),
        )
        for batch in eval_pbar:
            batch = {k: v.to(device) for k, v in batch.items()}
            with torch.no_grad():
                outputs = model(**batch)

            logits = outputs.logits
            predictions = torch.argmax(logits, dim=-1)
            predictions = predictions.detach().cpu().numpy()
            metric.add_batch(predictions=predictions, references=batch["labels"])

        metric.compute()

if __name__ == "__main__":
    main()

The main part of the error message looks like this:

home/johndoe_gmail_com/.cache/pypoetry/virtualenvs/xla-test-fbDoqoif-py3.8/lib/python3.8/site-packages/torch/autograd/__init__.py:266: UserWarning: ate
n::reshape: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect be
havior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have regist
ered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differe
ntiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Trigge
red internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)                                                                      
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass                                                         
                                                                                                                                                        2
024-04-17 14:58:26.214042: F ./torch_xla/csrc/runtime/debug_macros.h:20] Non-OK-status: status.status() status: RESOURCE_EXHAUSTED: XLA:TPU compile perma
nent error. Ran out of memory in memory space hbm. Used 31.37G of 15.48G hbm. Exceeded hbm capacity by 15.88G.                                           

Total hbm usage >= 31.88G:                                                                                                                               
    reserved        530.00M                                                                                                                              
    program          31.37G                                                                                                                              
    arguments            0B                                                                                                                              

Output size 0B; shares 0B with arguments.                                                                                                                

Program hbm requirement 31.37G:                                                                                                                          
    scoped           10.44M                                                                                                                              
    HLO temp         31.36G (99.9% utilization: Unpadded (31.26G) Padded (31.28G), 0.2% fragmentation (74.36M))

What I don't understand is that the bert-base-cased model with a batch size of 6 usually doesn't even take up 10GB of memory on a GPU. Am I doing something wrong w.r.t. changing my code for TPU usage?

seanswyi commented 6 months ago

After doing some research it seems like xm.optimizer_step(optimizer) is only to be used with multi-device settings and if I only want to use one device (as I'm doing now) then I have to use xm.mark_step().

I'm still curious why there's such a huge difference in terms of memory though.

JackCaoG commented 6 months ago

Can you follow https://github.com/pytorch/xla/blob/master/TROUBLESHOOTING.md#pytorchxla--dynamo-debugging-tool to do a quick debug run with PT_XLA_DEBUG=1? What we expected is that HLO only captures a single step of your training loop. If you tried adding mark_step after optimizer.step but still see this error and PT_XLA_DEBUG=1 is not too helpful, you can try to dump the IR or HLO following https://github.com/pytorch/xla/blob/master/TROUBLESHOOTING.md#common-debugging-environment-variables-combinations and share with us.