Model is not compiled when using `torch_compile=True` on a machine with multiple GPUs

System Info

transformers version: 4.31.0
Platform: Linux-4.14.318-241.531.amzn2.x86_64-x86_64-with-glibc2.31
Python version: 3.10.12
Huggingface_hub version: 0.16.4
Safetensors version: 0.3.1
Accelerate version: 0.21.0
Accelerate config: not found
PyTorch version (GPU?): 2.0.1+cu117 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: yes

Who can help?

@sgugger

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Run this code:

import torch
import evaluate
import numpy as np
from datasets import load_dataset, DatasetDict
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments

def preprocess_function(examples):
  return tokenizer(examples["text"], truncation=True, padding=True, return_tensors='pt').to(device="cuda:0")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels, average="weighted")

model_id = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True, model_max_length=512)

dataset = load_dataset('banking77', split=['train[:2048]', 'test[:512]'])
dataset = DatasetDict({'train': dataset[0], 'test': dataset[1]})
dataset = dataset.map(preprocess_function, batched=True)

labels = dataset["train"].features["label"].names
num_labels = len(labels)
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label

metric = evaluate.load("f1")

model = AutoModelForSequenceClassification.from_pretrained(
    model_id, num_labels=num_labels, label2id=label2id, id2label=id2label
)

training_args = TrainingArguments(
    output_dir="./temp",
    per_device_train_batch_size=128,
    per_device_eval_batch_size=128,
    learning_rate=5e-5,
    num_train_epochs=3,
    torch_compile=True,
    optim="adamw_torch_fused",
    logging_steps=1,
    logging_strategy="steps",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

trainer.train()

Expected behavior

This code is running as expected on a machine with a single GPU. The model is compiled (there is an output that says layers are optimized and stuff), and training speeds up significantly (well, not for this specific example model/data combination, but for the production one). Compilation-related output:

[2023-07-27 16:50:43,003] torch._inductor.utils: [WARNING] using triton random, expect difference from eager

But if I run the very same code on a machine with multiple GPUs - there are no signs of model compilation (no additional output in the logs) and the training speed does not improve. nvidia-smi output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A10G         Off  | 00000000:00:1B.0 Off |                    0 |
|  0%   30C    P8    16W / 300W |      0MiB / 22731MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A10G         Off  | 00000000:00:1C.0 Off |                    0 |
|  0%   33C    P8    16W / 300W |      0MiB / 22731MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A10G         Off  | 00000000:00:1D.0 Off |                    0 |
|  0%   30C    P8    15W / 300W |      0MiB / 22731MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A10G         Off  | 00000000:00:1E.0 Off |                    0 |
|  0%   31C    P8    16W / 300W |      0MiB / 22731MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

When running on my machine via main I do get the torch._inductor warning, meaning that compilation is happening (and verified looking at accelerate). I'm running on two t4's so I may not see the direct speed impact we may expect, but I got 45s with, 15s without. @sgugger any thoughts on why it might not be faster?

I haven't tried torch.compile on multiple GPUs as it wasn't ready when I was first experimenting.

I gave it another try, and torch_compile=True actually gives some minor additional performance (~10%), but still, in logs there are no signs of a model compilation

Logs

```bash (2.0.1) root@pytorch-2-0-0-gpu-p-ml-g5-12xlarge-de3ad04ae65352b8044617ab4259:~# python test_comppiled.py /root/.cache/huggingface/modules/datasets_modules/datasets/banking77/9898c11f6afa9521953d2ef205667b527bad14ef9cab445d470f16240c8c8ec4/banking77.py:59: FutureWarning: Dataset 'banking77' is deprecated and will be deleted. Use 'PolyAI/banking77' instead. warnings.warn( Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. 0%| | 0/50 [00:00

nvidia-smi:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A10G         Off  | 00000000:00:1B.0 Off |                    0 |
|  0%   29C    P8    16W / 300W |      0MiB / 22731MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A10G         Off  | 00000000:00:1C.0 Off |                    0 |
|  0%   28C    P8    16W / 300W |      0MiB / 22731MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A10G         Off  | 00000000:00:1D.0 Off |                    0 |
|  0%   29C    P8    16W / 300W |      0MiB / 22731MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A10G         Off  | 00000000:00:1E.0 Off |                    0 |
|  0%   29C    P8    16W / 300W |      0MiB / 22731MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

@eugene-kostrov can you report your versions of transformers and accelerate? Again when I was running this I could see logs when I was building from github/main on both :)

@muellerzr accelerate==0.21.0

tried both following transformers versions - the logs were the same transformers==4.31.0 transformers==4.32.0.dev2

Can you try installing via:

pip install git+https://github.com/huggingface/accelerate git+https://github.com/huggingface/transformers

Thanks @eawer!

Interesting as I definitely see the logs here.

accelerate launch test.py
/home/zach_mueller_huggingface_co/.cache/huggingface/modules/datasets_modules/datasets/banking77/9898c11f6afa9521953d2ef205667b527bad14ef9cab445d470f16240c8c8ec4/banking77.py:59: FutureWarning: Dataset 'banking77' is deprecated and will be deleted. Use 'PolyAI/banking77' instead.
  warnings.warn(
/home/zach_mueller_huggingface_co/.cache/huggingface/modules/datasets_modules/datasets/banking77/9898c11f6afa9521953d2ef205667b527bad14ef9cab445d470f16240c8c8ec4/banking77.py:59: FutureWarning: Dataset 'banking77' is deprecated and will be deleted. Use 'PolyAI/banking77' instead.
  warnings.warn(
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The speedups for torchdynamo mostly come wih GPU Ampere or higher and which is not detected here.
The speedups for torchdynamo mostly come wih GPU Ampere or higher and which is not detected here.
  0%|                                                                                                                                          | 0/24 [00:00<?, ?it/s]You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[2023-08-16 18:49:33,566] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:33,587] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:34,954] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:34,993] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:36,700] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:36,731] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:38,004] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:38,031] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:39,525] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:39,529] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:40,767] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:40,777] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:42,030] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:42,071] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:43,559] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:43,603] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:44,823] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:44,878] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:46,085] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:46,171] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:47,346] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:47,440] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:48,851] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:48,952] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:50,139] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:50,254] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:50,801] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:50,932] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[W reducer.cpp:1300] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1300] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
{'loss': 4.3287, 'learning_rate': 4.791666666666667e-05, 'epoch': 0.12}                                                                                               
{'loss': 4.2453, 'learning_rate': 4.5833333333333334e-05, 'epoch': 0.25}                                                                                              
{'loss': 4.1773, 'learning_rate': 4.375e-05, 'epoch': 0.38}                                                                                                           
{'loss': 4.0474, 'learning_rate': 4.166666666666667e-05, 'epoch': 0.5}                                                                                                
{'loss': 3.9611, 'learning_rate': 3.958333333333333e-05, 'epoch': 0.62}                                                                                               
{'loss': 3.9228, 'learning_rate': 3.7500000000000003e-05, 'epoch': 0.75}                                                                                              
{'loss': 3.8479, 'learning_rate': 3.541666666666667e-05, 'epoch': 0.88}                                                                                               
{'loss': 3.7447, 'learning_rate': 3.3333333333333335e-05, 'epoch': 1.0}                                                                                               
{'eval_loss': 3.690765380859375, 'eval_f1': 0.26359295290537466, 'eval_runtime': 0.3893, 'eval_samples_per_second': 1315.035, 'eval_steps_per_second': 5.137, 'epoch': 1.0}                                                                                                                                                                 
{'loss': 3.6982, 'learning_rate': 3.125e-05, 'epoch': 1.12}                                                                                                           
{'loss': 3.6525, 'learning_rate': 2.916666666666667e-05, 'epoch': 1.25}                                                                                               
{'loss': 3.5546, 'learning_rate': 2.7083333333333332e-05, 'epoch': 1.38}                                                                                              
{'loss': 3.5015, 'learning_rate': 2.5e-05, 'epoch': 1.5}                                                                                                              
{'loss': 3.4782, 'learning_rate': 2.2916666666666667e-05, 'epoch': 1.62}                                                                                              
{'loss': 3.4152, 'learning_rate': 2.0833333333333336e-05, 'epoch': 1.75}                                                                                              
{'loss': 3.3385, 'learning_rate': 1.8750000000000002e-05, 'epoch': 1.88}                                                                                              
{'loss': 3.3378, 'learning_rate': 1.6666666666666667e-05, 'epoch': 2.0}                                                                                               
{'eval_loss': 3.2540321350097656, 'eval_f1': 0.49614175520769455, 'eval_runtime': 0.2964, 'eval_samples_per_second': 1727.241, 'eval_steps_per_second': 6.747, 'epoch': 2.0}                                                                                                                                                                
{'loss': 3.2948, 'learning_rate': 1.4583333333333335e-05, 'epoch': 2.12}                                                                                              
{'loss': 3.2471, 'learning_rate': 1.25e-05, 'epoch': 2.25}                                                                                                            
{'loss': 3.2197, 'learning_rate': 1.0416666666666668e-05, 'epoch': 2.38}                                                                                              
{'loss': 3.1782, 'learning_rate': 8.333333333333334e-06, 'epoch': 2.5}                                                                                                
{'loss': 3.1959, 'learning_rate': 6.25e-06, 'epoch': 2.62}                                                                                                            
{'loss': 3.1684, 'learning_rate': 4.166666666666667e-06, 'epoch': 2.75}                                                                                               
{'loss': 3.1546, 'learning_rate': 2.0833333333333334e-06, 'epoch': 2.88}                                                                                              
{'loss': 3.1194, 'learning_rate': 0.0, 'epoch': 3.0}                                                                                                                  
{'eval_loss': 3.0891151428222656, 'eval_f1': 0.604639735844818, 'eval_runtime': 0.2952, 'eval_samples_per_second': 1734.197, 'eval_steps_per_second': 6.774, 'epoch': 3.0}                                                                                                                                                                  
{'train_runtime': 43.3381, 'train_samples_per_second': 141.769, 'train_steps_per_second': 0.554, 'train_loss': 3.576234668493271, 'epoch': 3.0}                       
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [00:43<00:00,  1.81s/it]

Using the pypi versions of accelerate and transformers on torch 2.0.1

@muellerzr sure

(test-comp) root@pytorch-2-0-0-gpu-p-ml-g5-12xlarge-abc:~# pip freeze | grep "transformers\|torch\|accelerate"
accelerate @ git+https://github.com/huggingface/accelerate@d087be01566477d99b660526adb7da4ec31abf1d
torch==2.0.1
transformers @ git+https://github.com/huggingface/transformers@1982dd3b15867c46e1c20645901b0de469fd935f

Here are results of this command for a single GPU (compilation works, ~42k lines) CUDA_VISIBLE_DEVICES=0 TRANSFORMERS_VERBOSITY=debug ACCELERATE_VERBOCITY=debug TORCH_COMPILE_DEBUG=1 TORCH_LOGS=dynamo,inductor,guards python test_comppiled.py 2>&1 | tee visible_devices_0.txt: visible_devices_0.txt

Here are results of this command for 4 GPUS (compilation does not happen, ~400 lines) CUDA_VISIBLE_DEVICES=0,1,2,3 TRANSFORMERS_VERBOSITY=debug ACCELERATE_VERBOCITY=debug TORCH_COMPILE_DEBUG=1 TORCH_LOGS=dynamo,inductor,guards python test_comppiled.py 2>&1 | tee visible_devices_0123.txt: visible_devices_0123.txt

@eawer the issue here is the fact the trainer doesn't support model parallelism for torch compile yet. If you use DDP (such as using accelerate launch instead) it will run and log exactly as we expect. cc @SunMarc

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers