Trainer does not call torch.compile when torch_compile=True in TrainingArguments

singularity-s0 commented 2 weeks ago

System Info

Platform: Linux-6.8.0-41-generic-x86_64-with-glibc2.39
Python version: 3.12.7
Huggingface_hub version: 0.24.6
Safetensors version: 0.4.4
Accelerate version: 1.0.1
Accelerate config: - compute_environment: LOCAL_MACHINE
- distributed_type: FSDP
- mixed_precision: fp16
- use_cpu: False
- debug: False
- num_processes: 16
- machine_rank: 1
- num_machines: 2
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: False
- fsdp_config: {'fsdp_activation_checkpointing': False, 'fsdp_auto_wrap_policy': 'TRANSFORMER_BASED_WRAP', 'fsdp_backward_prefetch': 'BACKWARD_PRE', 'fsdp_cpu_ram_efficient_loading': True, 'fsdp_forward_prefetch': False, 'fsdp_offload_params': False, 'fsdp_sharding_strategy': 'FULL_SHARD', 'fsdp_state_dict_type': 'SHARDED_STATE_DICT', 'fsdp_sync_module_states': True, 'fsdp_transformer_layer_cls_to_wrap': 'LlamaDecoderLayer', 'fsdp_use_orig_params': True}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
- dynamo_config: {'dynamo_backend': 'INDUCTOR', 'dynamo_mode': 'max-autotune', 'dynamo_use_dynamic': False, 'dynamo_use_fullgraph': False}
PyTorch version (GPU?): 2.5.1+cu124 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?:
Using GPU in script?: No
GPU type: NVIDIA H100 80GB HBM3

Who can help?

@muellerzr @SunMa

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Using the following test script:

from transformers import LlamaConfig, LlamaForCausalLM, Trainer, TrainingArguments
import torch
import tempfile
import logging

device = "cuda" if torch.cuda.is_available() else "cpu"

class RepeatDataset:
    def __init__(self, x, length=64):
        self.x = x
        self.length = length

    def __len__(self):
        return self.length

    def __getitem__(self, i):
        return {"input_ids": self.x, "labels": self.x}

config = LlamaConfig(vocab_size=100, hidden_size=32, num_hidden_layers=3, num_attention_heads=4)

x = torch.randint(0, 100, (128,))
train_dataset = RepeatDataset(x)

def test_torch_compile_hf_trainer():
    tiny_llama = LlamaForCausalLM(config).to(device)
    with tempfile.TemporaryDirectory() as tmp_dir:
        args = TrainingArguments(
            tmp_dir,
            per_device_train_batch_size=2,
            torch_compile=True,
            max_steps=1,  # compile happens on the first step
        )
        trainer = Trainer(model=tiny_llama, args=args, train_dataset=train_dataset)  # noqa
        trainer.train()

def test_torch_compile():
    tiny_llama = LlamaForCausalLM(config).to(device)
    tiny_llama = torch.compile(tiny_llama, mode="max-autotune")

    input_ids = train_dataset[0]['input_ids'].unsqueeze(0).to(device)
    tiny_llama(input_ids=input_ids)

torch._logging.set_logs(dynamo=logging.INFO)

When running test_torch_compile(), there will be many lines of logs showing the compilation process of torch, and in the end, there will be a summary like this:

I1108 10:10:27.721000 1968843 site-packages/torch/_dynamo/utils.py:399] TorchDynamo compilation metrics:
I1108 10:10:27.721000 1968843 site-packages/torch/_dynamo/utils.py:399] Function                                  Runtimes (s)
I1108 10:10:27.721000 1968843 site-packages/torch/_dynamo/utils.py:399] --------------------------------------  --------------
I1108 10:10:27.721000 1968843 site-packages/torch/_dynamo/utils.py:399] _compile.compile_inner                         69.1241
I1108 10:10:27.721000 1968843 site-packages/torch/_dynamo/utils.py:399] OutputGraph.call_user_compiler                 68.5498
I1108 10:10:27.721000 1968843 site-packages/torch/_dynamo/utils.py:399] create_aot_dispatcher_function                 68.5111
I1108 10:10:27.721000 1968843 site-packages/torch/_dynamo/utils.py:399] compile_fx.<locals>.fw_compiler_base           67.0193
I1108 10:10:27.721000 1968843 site-packages/torch/_dynamo/utils.py:399] compile_fx_inner                               67.019
I1108 10:10:27.721000 1968843 site-packages/torch/_dynamo/utils.py:399] GraphLowering.run                              44.2316
I1108 10:10:27.721000 1968843 site-packages/torch/_dynamo/utils.py:399] GraphLowering.compile_to_module                18.0673
I1108 10:10:27.721000 1968843 site-packages/torch/_dynamo/utils.py:399] Scheduler.__init__                             14.038
I1108 10:10:27.721000 1968843 site-packages/torch/_dynamo/utils.py:399] CachingAutotuner.benchmark_all_configs          1.3724
I1108 10:10:27.721000 1968843 site-packages/torch/_dynamo/utils.py:399] Scheduler.codegen                               0.306
I1108 10:10:27.721000 1968843 site-packages/torch/_dynamo/utils.py:399] WrapperCodeGen.generate                         0.0055
I1108 10:10:27.721000 1968843 site-packages/torch/_dynamo/utils.py:399] cudagraphify                                    0.0001

When running test_torch_compile_hf_trainer(), however, there will be no log related to torch dynamo at all. The summary in the end will also be empty:

I1108 10:08:54.393000 1968523 site-packages/torch/_dynamo/utils.py:399] TorchDynamo compilation metrics:
I1108 10:08:54.393000 1968523 site-packages/torch/_dynamo/utils.py:399] Function    Runtimes (s)
I1108 10:08:54.393000 1968523 site-packages/torch/_dynamo/utils.py:399] ----------  --------------

This indicates that the model is not being compiled at all.

Expected behavior

Setting torch_compile=True in TrainingArguments should make Trainer compile the model properly.

singularity-s0 commented 2 weeks ago

This seems to be related to multiple GPUs. The issue doesn't exist if only 1 GPU is set in CUDA_VISIBLE_DEVICES.

LysandreJik commented 1 week ago

This seems like a potential issue with the logs rather than with torch.compile

cc @MekkCyber, if you have the banwidth, could you take a look at this?

MekkCyber commented 4 days ago

hi @singularity-s0, you just need to launch the script using accelerate launch script.py because it's a multi gpu setting

singularity-s0 commented 3 days ago

OK Thanks. Would you be kind enough to point out where torch.compile is called in the code? So that I can better analyze the logic.

huggingface / transformers