Closed MoritzLaurer closed 3 weeks ago
A user reported that they also had the issue with torchrun --nproc-per-node 1
vs. torchrun --nproc-per-node 4
and an iterable dataset, so it might actually not be linked to accelerate
but to datasets
and specifically the behaviour of iterable datasets on multiple GPUs. Could you maybe have a look at this @lhoestq ?
Im guessing this has to do with the length of the IterableDatasetShard rounding up to a multiple of the batch size. https://github.com/huggingface/accelerate/blob/v0.33.0/src/accelerate/data_loader.py#L320
Then in the DistributedTensorGatherer, when usually we truncate to num_examples from a multiple of the batch size, in this case we have provided num_examples = length of IterableDatasetShard so nothing to truncate and it keeps the extra samples added to last batch. https://github.com/huggingface/transformers/blob/e8401273704e550b38879eef2f92f0e4866636b8/src/transformers/trainer_pt_utils.py#L541
The duplication seems to be expected behaviour for the IterableDatasetShard then, to keep the batch size equal across GPUs by either dropping the last batch or duplicating.
The gather_for_metrics
method described here could be useful for creating workarounds.
How would you recommend addressing this issue to ensure evaluation without duplicates @muellerzr ?
Or maybe @lhoestq might know how to best address the duplication issue to get evaluations with the correct number of labels with streaming datasets in multi-gpu settings?
By default the Trainer
should be using gather_for_metrics
. This requires knowing the full length of the dataloader, which is probably why it can't be used here. We do warn that if you can't, users should manually drop the last size themselves. If we have a method to help us figure out when we're on the last batch and/or the total size, then we can do something. Otherwise it's a manual process
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
Using a slightly modified version of the run_glue.py script.
When I run the following code with
streaming=False
on two GPUs (--num_process 2
), the shape of the labels returned in compute_metrics is(408,)
(which is the correct number of rows in the mrpc validation set). When I run the same script withstreaming=True
on two GPUs, the shape of the labels returned in compute_metrics is(416,)
, which means that some labels were duplicated. When I run the same with--num_process 1
andstreaming=True
, the shape is correct again with 408. It seems like using streaming / an iterable dataset on multiple GPUs with accelerate leads to duplication of some data in compute_metrics.Expected behavior
Get the correct number of labels in compute_metrics for correct metric calculation.
@muellerzr @ssharpe42 See this internal conversation for context.