Closed jp1924 closed 1 year ago
The evaluation loop in the Trainer
does not support un-padded outputs indeed, as it doesn't occur with any model of the library in our examples. Fixing it would be quite involved so I'd recommend using the Accelerate
library which provides a method to pad across processes to evaluate such models.
System Info
system info OS: Ubuntu 18.04.6 LTS GPUS: RTX 3090 * 2 CUDA: 11.1
python: 3.8 transformers: 4.23.1 pytorch: 1.10.1+cu111 NCCL: 2.10.3+cuda11.1
Who can help?
@sgugger @patrickvonplaten
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
This issue occurred during the implementation of the Streaming model called Transformer-Transducer as HuggingFace.
Before explaining this issue, it is first necessary to know the loss used by this model. this model uses a loss function called RNN-T loss provided by torchaudio. Unlike CTC-loss, RNN-T loss uses logits in 4 dimensions tensors like this
Depending on the data entered here, mel_seq and max_target_length will vary ex) [cuda:0]output_logits shape: (4, 512, 42, 111) [cuda:1]output_logits shape: (4, 286, 32, 111)
and this model uses LogMel-Spectrogram as train_data
This issue occurs in evaluation_loop when training using single-node DDP in the Trainer.
When i evaluating this model, issue occurred like below
This is a issue that arises from the all_gather feature of DDP.
The all_gather has the function of receiving a tensors from all devices belonging to the group However, this issue occurs in the process of importing the tensors
the size of the "output_tensors" is smaller than the size of the "tensors", the same "mismatch between collectives" problem occurs as above.
In above code, "TORCH_DISTRIBUTED_DEBUG" is set to "DETAIL", but if it isn't done, an error will not be printed. all_gather just returns "output_tensors" to None.
But evaluation_loop all_gather returns "output_tensor" and then does "torch.concat" with the existing tensor In particular, in the process of "torch.concat " "output_tensors" in the None state with an existing tensor, i found a problem that does not output errors and takes on infinite loop.
In fact, i know that Transformer-Transducer is a model that is not supported by Huggingface, and this problem occurs by using a model that is not suitable for Huggingface Trainer
But I think it would be cool to add a streaming ASR model such as Transformer-Transducer to the huggingface, so it's an issue i found during the experiment. So if there's any way or idea to solve this problem, I'd like you to know