trainer.evaluate infinite loop problem

System Info

system info OS: Ubuntu 18.04.6 LTS GPUS: RTX 3090 * 2 CUDA: 11.1

python: 3.8 transformers: 4.23.1 pytorch: 1.10.1+cu111 NCCL: 2.10.3+cuda11.1

Who can help?

@sgugger @patrickvonplaten

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

from transformers import TrainingArguments, Trainer, BertTokenizerFast, HfArgumentParser
from transformers.utils import ModelOutput
from transformers.trainer_utils import is_main_process
from datasets import load_dataset, Dataset

import torch
import torch.nn as nn
import torch.distributed as dist

class DummyModeloutput(ModelOutput):
    loss: torch.FloatTensor = None
    logits: torch.FloatTensor = None

class DummyModel(nn.Module):
    def __init__(self) -> None:
        super(DummyModel, self).__init__()
        self.dummy_layer = nn.Linear(10, 10)
        self.count = 0

    def forward(self, input_ids, labels, *args, **kwargs):

        rank = dist.get_rank()
        device = torch.device(rank)
        if is_main_process(rank):
            logits = torch.zeros((2, 512, 42 + self.count, 111), device=device)
        else:
            logits = torch.ones((2, 231, 70 + self.count, 111), device=device)

        loss = torch.tensor([0.5], device=device)

        self.count += 1

        return DummyModeloutput(loss=loss, logits=logits)

def main(parser: HfArgumentParser) -> None:
    args, _ = parser.parse_args_into_dataclasses(return_remaining_strings=True)

    def imdb_preprocesser(dataset: Dataset) -> dict:
        text = dataset["text"]
        label = dataset["label"]

        encoded_data = tokenizer(text, return_attention_mask=False)
        encoded_data["label"] = label

        return encoded_data

    tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
    model = DummyModel()

    imdb_data = load_dataset("imdb")

    train_data = imdb_data["train"].train_test_split(0.02)["test"]
    valid_data = imdb_data["test"]

    train_data = train_data.map(imdb_preprocesser, num_proc=3)
    valid_data = valid_data.map(imdb_preprocesser, num_proc=3)

    trainer = Trainer(
        model=model,
        tokenizer=tokenizer,
        train_dataset=train_data,
        eval_dataset=valid_data,
        args=args,
        compute_metrics=lambda x: x,
    )

    trainer.evaluate(eval_dataset=valid_data)

if "__main__" in __name__:
    parser = HfArgumentParser([TrainingArguments])
    main(parser)
"""
for vscode user
launch.json
        {
            "name": "Python: infinite_loop",
            "type": "python",
            "request": "launch",
            "module": "torch.distributed.launch",
            "console": "integratedTerminal",
            "justMyCode": false,
            "env": {
                "CUDA_VISIBLE_DEVICES": "0, 2",
                "WANDB_DISABLED": "true",
                "TORCH_CPP_LOG_LEVEL": "DEBUG",
                "NCCL_DEBUG": "INFO",
                "NCCL_DEBUG_SUBSYS": "COLL",
                // "TORCH_DISTRIBUTED_DEBUG": "DETAIL",
            },
            "args": [
                "--standalone",
                "--nnodes=1",
                "--nproc_per_node=2",
                "",
                "--output_dir=",
                "--do_train=true",
                "--do_eval=true",
                "--do_eval=true",
                "--per_device_train_batch_size=2",
                "--learning_rate=1e-5",
                "--evaluation_strategy=steps",
                "--eval_steps=2",
                "--save_strategy=no"
            ]
        },
"""

Expected behavior

This issue occurred during the implementation of the Streaming model called Transformer-Transducer as HuggingFace.

Before explaining this issue, it is first necessary to know the loss used by this model. this model uses a loss function called RNN-T loss provided by torchaudio. Unlike CTC-loss, RNN-T loss uses logits in 4 dimensions tensors like this

>>> logits.shape
(batch, max seq length, max target length + 1, class)

Depending on the data entered here, mel_seq and max_target_length will vary ex) [cuda:0]output_logits shape: (4, 512, 42, 111) [cuda:1]output_logits shape: (4, 286, 32, 111)

and this model uses LogMel-Spectrogram as train_data

This issue occurs in evaluation_loop when training using single-node DDP in the Trainer.

When i evaluating this model, issue occurred like below

Detected mismatch between collectives on ranks. Rank 1 is running inconsistent collective:
CollectiveFingerPrint(
    OpType=ALLGATHER,
    TensorShape=[1, 279, 44, 72],
    TensorDtypes=Float,
    TensorDeviceTypes=TensorOptions(
        dtype=float (default),
        device=cuda,
        layout=Strided (default),
        requires_grad=false (default),
        pinned_memory=false (default),
        memory_format=(nullopt)
    )
)

  File "[My_folder_path]/venv_for_transducer/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2003, in all_gather
    work = default_pg.allgather([tensor_list], [tensor])

  File "[My_folder_path]/venv_for_transducer/lib/python3.8/site-packages/transformers/trainer_pt_utils.py", line 212, in distributed_concat
    dist.all_gather(output_tensors, tensor)

  File "[My_folder_path]/venv_for_transducer/lib/python3.8/site-packages/transformers/trainer.py", line 3101, in _nested_gather
    tensors = distributed_concat(tensors)

  File "[My_folder_path]/venv_for_transducer/lib/python3.8/site-packages/transformers/trainer.py", line 2987, in evaluation_loop
    logits = self._nested_gather(logits)

  File "[My_folder_path]/venv_for_transducer/lib/python3.8/site-packages/transformers/trainer.py", line 2774, in evaluate
    output = eval_loop(

  File "[My_folder_path]/venv_for_transducer/lib/python3.8/site-packages/transformers/trainer.py", line 2052, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)

  File "[My_folder_path]/venv_for_transducer/lib/python3.8/site-packages/transformers/trainer.py", line 1819, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)

  File "[My_folder_path]/venv_for_transducer/lib/python3.8/site-packages/transformers/trainer.py", line 1500, in train
    return inner_training_loop(

  File "[My_folder_path]/transformer-transducer/main.py", line 115, in train
    outputs = trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)

  File "[My_folder_path]/transformer-transducer/main.py", line 96, in main
    train(trainer, train_args)

  File "[My_folder_path]/transformer-transducer/main.py", line 160, in <module>
    main(parser)

This is a issue that arises from the all_gather feature of DDP.

The all_gather has the function of receiving a tensors from all devices belonging to the group However, this issue occurs in the process of importing the tensors

from transformers.trainer_utils import is_main_process
import torch.distributed as dist
import torch
import os

def main() -> None:
    dist.init_process_group("nccl")

    rank = dist.get_rank()
    device = torch.device(rank)
    if is_main_process(rank):
        tensor = torch.zeros((2, 100, 100), device=device)
    else:
        tensor = torch.ones((2, 100, 70), device=device)

    output_tensors = [tensor.clone() for _ in range(dist.get_world_size())]
    dist.all_gather(output_tensors, tensor)

if "__main__" in __name__:
    os.environ["TORCH_DISTRIBUTED_DEBUG"] = "DETAIL"
    os.enviton["CUDA_VISIBLE_DEVICES"] = "0,1"
    os.environ['TORCH_CPP_LOG_LEVEL']="DEBUG"
    main()

the size of the "output_tensors" is smaller than the size of the "tensors", the same "mismatch between collectives" problem occurs as above.

In above code, "TORCH_DISTRIBUTED_DEBUG" is set to "DETAIL", but if it isn't done, an error will not be printed. all_gather just returns "output_tensors" to None.

But evaluation_loop all_gather returns "output_tensor" and then does "torch.concat" with the existing tensor In particular, in the process of "torch.concat " "output_tensors" in the None state with an existing tensor, i found a problem that does not output errors and takes on infinite loop.

In fact, i know that Transformer-Transducer is a model that is not supported by Huggingface, and this problem occurs by using a model that is not suitable for Huggingface Trainer

But I think it would be cool to add a streaming ASR model such as Transformer-Transducer to the huggingface, so it's an issue i found during the experiment. So if there's any way or idea to solve this problem, I'd like you to know

huggingface / transformers