huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.36k stars 26.87k forks source link

trainer.evaluate infinite loop problem #20366

Closed jp1924 closed 1 year ago

jp1924 commented 1 year ago

System Info

system info OS: Ubuntu 18.04.6 LTS GPUS: RTX 3090 * 2 CUDA: 11.1

python: 3.8 transformers: 4.23.1 pytorch: 1.10.1+cu111 NCCL: 2.10.3+cuda11.1

Who can help?

@sgugger @patrickvonplaten

Information

Tasks

Reproduction

from transformers import TrainingArguments, Trainer, BertTokenizerFast, HfArgumentParser
from transformers.utils import ModelOutput
from transformers.trainer_utils import is_main_process
from datasets import load_dataset, Dataset

import torch
import torch.nn as nn
import torch.distributed as dist

class DummyModeloutput(ModelOutput):
    loss: torch.FloatTensor = None
    logits: torch.FloatTensor = None

class DummyModel(nn.Module):
    def __init__(self) -> None:
        super(DummyModel, self).__init__()
        self.dummy_layer = nn.Linear(10, 10)
        self.count = 0

    def forward(self, input_ids, labels, *args, **kwargs):

        rank = dist.get_rank()
        device = torch.device(rank)
        if is_main_process(rank):
            logits = torch.zeros((2, 512, 42 + self.count, 111), device=device)
        else:
            logits = torch.ones((2, 231, 70 + self.count, 111), device=device)

        loss = torch.tensor([0.5], device=device)

        self.count += 1

        return DummyModeloutput(loss=loss, logits=logits)

def main(parser: HfArgumentParser) -> None:
    args, _ = parser.parse_args_into_dataclasses(return_remaining_strings=True)

    def imdb_preprocesser(dataset: Dataset) -> dict:
        text = dataset["text"]
        label = dataset["label"]

        encoded_data = tokenizer(text, return_attention_mask=False)
        encoded_data["label"] = label

        return encoded_data

    tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
    model = DummyModel()

    imdb_data = load_dataset("imdb")

    train_data = imdb_data["train"].train_test_split(0.02)["test"]
    valid_data = imdb_data["test"]

    train_data = train_data.map(imdb_preprocesser, num_proc=3)
    valid_data = valid_data.map(imdb_preprocesser, num_proc=3)

    trainer = Trainer(
        model=model,
        tokenizer=tokenizer,
        train_dataset=train_data,
        eval_dataset=valid_data,
        args=args,
        compute_metrics=lambda x: x,
    )

    trainer.evaluate(eval_dataset=valid_data)

if "__main__" in __name__:
    parser = HfArgumentParser([TrainingArguments])
    main(parser)
"""
for vscode user
launch.json
        {
            "name": "Python: infinite_loop",
            "type": "python",
            "request": "launch",
            "module": "torch.distributed.launch",
            "console": "integratedTerminal",
            "justMyCode": false,
            "env": {
                "CUDA_VISIBLE_DEVICES": "0, 2",
                "WANDB_DISABLED": "true",
                "TORCH_CPP_LOG_LEVEL": "DEBUG",
                "NCCL_DEBUG": "INFO",
                "NCCL_DEBUG_SUBSYS": "COLL",
                // "TORCH_DISTRIBUTED_DEBUG": "DETAIL",
            },
            "args": [
                "--standalone",
                "--nnodes=1",
                "--nproc_per_node=2",
                "",
                "--output_dir=",
                "--do_train=true",
                "--do_eval=true",
                "--do_eval=true",
                "--per_device_train_batch_size=2",
                "--learning_rate=1e-5",
                "--evaluation_strategy=steps",
                "--eval_steps=2",
                "--save_strategy=no"
            ]
        },
"""

Expected behavior


This issue occurred during the implementation of the Streaming model called Transformer-Transducer as HuggingFace.

Before explaining this issue, it is first necessary to know the loss used by this model. this model uses a loss function called RNN-T loss provided by torchaudio. Unlike CTC-loss, RNN-T loss uses logits in 4 dimensions tensors like this

>>> logits.shape
(batch, max seq length, max target length + 1, class)

Depending on the data entered here, mel_seq and max_target_length will vary ex) [cuda:0]output_logits shape: (4, 512, 42, 111) [cuda:1]output_logits shape: (4, 286, 32, 111)

and this model uses LogMel-Spectrogram as train_data


This issue occurs in evaluation_loop when training using single-node DDP in the Trainer.

When i evaluating this model, issue occurred like below

Detected mismatch between collectives on ranks. Rank 1 is running inconsistent collective:
CollectiveFingerPrint(
    OpType=ALLGATHER,
    TensorShape=[1, 279, 44, 72],
    TensorDtypes=Float,
    TensorDeviceTypes=TensorOptions(
        dtype=float (default),
        device=cuda,
        layout=Strided (default),
        requires_grad=false (default),
        pinned_memory=false (default),
        memory_format=(nullopt)
    )
)

  File "[My_folder_path]/venv_for_transducer/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2003, in all_gather
    work = default_pg.allgather([tensor_list], [tensor])

  File "[My_folder_path]/venv_for_transducer/lib/python3.8/site-packages/transformers/trainer_pt_utils.py", line 212, in distributed_concat
    dist.all_gather(output_tensors, tensor)

  File "[My_folder_path]/venv_for_transducer/lib/python3.8/site-packages/transformers/trainer.py", line 3101, in _nested_gather
    tensors = distributed_concat(tensors)

  File "[My_folder_path]/venv_for_transducer/lib/python3.8/site-packages/transformers/trainer.py", line 2987, in evaluation_loop
    logits = self._nested_gather(logits)

  File "[My_folder_path]/venv_for_transducer/lib/python3.8/site-packages/transformers/trainer.py", line 2774, in evaluate
    output = eval_loop(

  File "[My_folder_path]/venv_for_transducer/lib/python3.8/site-packages/transformers/trainer.py", line 2052, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)

  File "[My_folder_path]/venv_for_transducer/lib/python3.8/site-packages/transformers/trainer.py", line 1819, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)

  File "[My_folder_path]/venv_for_transducer/lib/python3.8/site-packages/transformers/trainer.py", line 1500, in train
    return inner_training_loop(

  File "[My_folder_path]/transformer-transducer/main.py", line 115, in train
    outputs = trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)

  File "[My_folder_path]/transformer-transducer/main.py", line 96, in main
    train(trainer, train_args)

  File "[My_folder_path]/transformer-transducer/main.py", line 160, in <module>
    main(parser)

This is a issue that arises from the all_gather feature of DDP.

The all_gather has the function of receiving a tensors from all devices belonging to the group However, this issue occurs in the process of importing the tensors

from transformers.trainer_utils import is_main_process
import torch.distributed as dist
import torch
import os

def main() -> None:
    dist.init_process_group("nccl")

    rank = dist.get_rank()
    device = torch.device(rank)
    if is_main_process(rank):
        tensor = torch.zeros((2, 100, 100), device=device)
    else:
        tensor = torch.ones((2, 100, 70), device=device)

    output_tensors = [tensor.clone() for _ in range(dist.get_world_size())]
    dist.all_gather(output_tensors, tensor)

if "__main__" in __name__:
    os.environ["TORCH_DISTRIBUTED_DEBUG"] = "DETAIL"
    os.enviton["CUDA_VISIBLE_DEVICES"] = "0,1"
    os.environ['TORCH_CPP_LOG_LEVEL']="DEBUG"
    main()

the size of the "output_tensors" is smaller than the size of the "tensors", the same "mismatch between collectives" problem occurs as above.

In above code, "TORCH_DISTRIBUTED_DEBUG" is set to "DETAIL", but if it isn't done, an error will not be printed. all_gather just returns "output_tensors" to None.

But evaluation_loop all_gather returns "output_tensor" and then does "torch.concat" with the existing tensor In particular, in the process of "torch.concat " "output_tensors" in the None state with an existing tensor, i found a problem that does not output errors and takes on infinite loop.


In fact, i know that Transformer-Transducer is a model that is not supported by Huggingface, and this problem occurs by using a model that is not suitable for Huggingface Trainer

But I think it would be cool to add a streaming ASR model such as Transformer-Transducer to the huggingface, so it's an issue i found during the experiment. So if there's any way or idea to solve this problem, I'd like you to know

sgugger commented 1 year ago

The evaluation loop in the Trainer does not support un-padded outputs indeed, as it doesn't occur with any model of the library in our examples. Fixing it would be quite involved so I'd recommend using the Accelerate library which provides a method to pad across processes to evaluate such models.