huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.23k stars 26.61k forks source link

Trainer having issues with DataLoaderShard when running with torchrun #31457

Open mohummedalee opened 3 months ago

mohummedalee commented 3 months ago

System Info

Who can help?

@muellerzr @SunMarc

Information

Tasks

Reproduction

I am fine-tuning a RoBERTa with differential privacy (using PyTorch's Opacus). This is the specific script I'm running using torchrun for distributed training. My code also relies on private-transformers but as you can see in the stacktrace below, the error happens inside HuggingFace's Trainer and I have made a quick fix inside the Trainer source code (shown below) to make my code work. However, I am opening an issue here to see if this is a general issue that needs fixing.

Traceback (most recent call last):
  File "/work/fairness-privacy/src/train.py", line 335, in <module>
    train_helper(args, dataset['train'], dataset['validation'])
  File "/work/fairness-privacy/src/train.py", line 300, in train_helper
    model_ft = train_private(args, train_data_tok, val_data_tok)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/fairness-privacy/src/train.py", line 160, in train_private
    trainer.train(model_path=None, dev_objective="eval_accuracy")
  File "/work/fairness-privacy/private-transformers/examples/classification/src/trainer.py", line 401, in train
    logging_loss_scalar = self.evaluate_and_log(
                          ^^^^^^^^^^^^^^^^^^^^^^
  File "/work/fairness-privacy/private-transformers/examples/classification/src/trainer.py", line 586, in evaluate_and_log
    output = self.evaluate()
             ^^^^^^^^^^^^^^^
  File "/work/fairness-privacy/private-transformers/examples/classification/src/trainer.py", line 569, in evaluate
    output = self.prediction_loop(eval_dataloader, description="Evaluation")
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/fairness-privacy/condaenv/lib/python3.11/site-packages/transformers/trainer.py", line 3862, in prediction_loop
    losses = loss.repeat(batch_size)
             ^^^^^^^^^^^^^^^^^^^^^^^
TypeError: repeat(): argument 'repeats' (position 1) must be tuple of ints, but found element of type NoneType at pos 0

I am executing this script using:

EPOCHS=1
BATCH_SIZE=64
EPSILON=8
MODEL_OUT="models/roberta-priv-eps_${EPSILON}_epochs_${EPOCHS}-bs_${BATCH_SIZE}"
N_GPUS=1

torchrun --nnodes=1 --nproc-per-node=${N_GPUS} src/train.py \
    --train-mode private \
    --data-path /work/fairness-privacy/twitteraae-sentiment-data-split/ \
    --epochs $EPOCHS \
    --model-out-path $MODEL_OUT \
    --tracking-interval 5000 \
    --priv-epsilon $EPSILON \
    --priv-max-grad-norm 0.1 \
    --do-eval

I am able to avoid this error when I make the following hack inside prediction_loop:

from accelerate.data_loader import DataLoaderShard
if type(dataloader) == DataLoaderShard:
    batch_size = dataloader.total_batch_size
else:
    batch_size = dataloader.batch_size

Expected behavior

Expected behavior is that prediction_loop runs normally and the function that calls it (evaluate_and_log) is able to log the evaluation results during the training process. On a more fine-grained level batch_size should be a scalar and not None as is happening in this case so losses = loss.repeat(batch_size) inside prediction_loop is able to run.

amyeroberts commented 2 months ago

Gentle ping @SunMarc @muellerzr

github-actions[bot] commented 3 days ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.