Open JackCai1206 opened 3 months ago
Interesting! Would you like to open a PR? (maybe torch.pad would work better? )
Hello @JackCai1206, have you tried to use a custom collator and passing it to the trainer using the collate
parameter? I also have an issue with custom 4d masks during training (using a custom collator) but my issue is related to OOM...
Hi @ArthurZucker, I am interested in working on this issue, could I take it up?
Hey! sure, opening a PR is the way to go 🤗
System Info
transformers 4.41.0
Who can help?
@ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
Since the trace of the profiler is really long I only included the first few lines. I am running a small llama model on some dummy data, the only difference between the two datasets is that the slow version outputs 4D attention masks, which is a feature recently added in #27539. I am running both trainers for 1 iteration.
As you can see the slow run is 340s while the fast one runs in 16s.
The slow version of the trainer is many times slower than the fast version. The problem probably lies in the default collator
DataCollatorWithPadding
(when there is a pretrained tokenizer), which callstokenizer.pad
on the 4D attention masks. When you takeaway either 1) the pretrained tokenizer or 2) the 4D attention mask, trainer runs much faster.