Open tomtseng opened 1 month ago
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
@MekkCyber is looking into that !
System Info
transformers
version: 4.44.2Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
This is a duplicate of #24098 and #25695, but I figured it'd still be useful to resubmit this issue since (1) I have a code example, and (2) I paste a different error message I get with mixed precision, which may increase visibility for other people who run into this problem and search for existing GitHub issues.
When I do multi-GPU training (launched with
accelerate launch --num_processes=2
) usingTrainer
with a small dataset size andgradient_accumulation_steps > 2
, I often repeatedly get the following error:If FP16 mixed-precision is enabled then the error looks like this instead:
Here's a minimal example — run the following with
accelerate launch --config_file=accelerate_config.yaml --num_processes=2 program.py
My use case for this was that I had a codebase where we had added some end-to-end tests. We used a very small dataset size since we wanted the test to still be reasonably fast, but then we hit into these exceptions and were confused.
Expected behavior
I think I expect this to just work without crashing. But maybe it's not really a sensible setup to have such a small training set. In #24098 commenters suggested that the training set size
In that case it would be nice to have an error message saying that this is the problem.