Trainer having issues with DataLoaderShard when running with torchrun

mohummedalee commented 3 months ago

System Info

transformers version: 4.37.2
Platform: Linux-3.10.0-1160.25.1.el7.x86_64-x86_64-with-glibc2.17
Python version: 3.11.8
Huggingface_hub version: 0.23.3
Safetensors version: 0.4.2
Accelerate version: 0.31.0
Accelerate config: not found
PyTorch version (GPU?): 2.2.2 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes (CUDA)
Using distributed or parallel set-up in script?: Yes, running with torchrun --nnodes=1 --nproc-per-node=${N_GPUS}

Who can help?

@muellerzr @SunMarc

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

I am fine-tuning a RoBERTa with differential privacy (using PyTorch's Opacus). This is the specific script I'm running using torchrun for distributed training. My code also relies on private-transformers but as you can see in the stacktrace below, the error happens inside HuggingFace's Trainer and I have made a quick fix inside the Trainer source code (shown below) to make my code work. However, I am opening an issue here to see if this is a general issue that needs fixing.

Traceback (most recent call last):
  File "/work/fairness-privacy/src/train.py", line 335, in <module>
    train_helper(args, dataset['train'], dataset['validation'])
  File "/work/fairness-privacy/src/train.py", line 300, in train_helper
    model_ft = train_private(args, train_data_tok, val_data_tok)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/fairness-privacy/src/train.py", line 160, in train_private
    trainer.train(model_path=None, dev_objective="eval_accuracy")
  File "/work/fairness-privacy/private-transformers/examples/classification/src/trainer.py", line 401, in train
    logging_loss_scalar = self.evaluate_and_log(
                          ^^^^^^^^^^^^^^^^^^^^^^
  File "/work/fairness-privacy/private-transformers/examples/classification/src/trainer.py", line 586, in evaluate_and_log
    output = self.evaluate()
             ^^^^^^^^^^^^^^^
  File "/work/fairness-privacy/private-transformers/examples/classification/src/trainer.py", line 569, in evaluate
    output = self.prediction_loop(eval_dataloader, description="Evaluation")
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/fairness-privacy/condaenv/lib/python3.11/site-packages/transformers/trainer.py", line 3862, in prediction_loop
    losses = loss.repeat(batch_size)
             ^^^^^^^^^^^^^^^^^^^^^^^
TypeError: repeat(): argument 'repeats' (position 1) must be tuple of ints, but found element of type NoneType at pos 0

I am executing this script using:

EPOCHS=1
BATCH_SIZE=64
EPSILON=8
MODEL_OUT="models/roberta-priv-eps_${EPSILON}_epochs_${EPOCHS}-bs_${BATCH_SIZE}"
N_GPUS=1

torchrun --nnodes=1 --nproc-per-node=${N_GPUS} src/train.py \
    --train-mode private \
    --data-path /work/fairness-privacy/twitteraae-sentiment-data-split/ \
    --epochs $EPOCHS \
    --model-out-path $MODEL_OUT \
    --tracking-interval 5000 \
    --priv-epsilon $EPSILON \
    --priv-max-grad-norm 0.1 \
    --do-eval

I am able to avoid this error when I make the following hack inside prediction_loop:

from accelerate.data_loader import DataLoaderShard
if type(dataloader) == DataLoaderShard:
    batch_size = dataloader.total_batch_size
else:
    batch_size = dataloader.batch_size

Expected behavior

Expected behavior is that prediction_loop runs normally and the function that calls it (evaluate_and_log) is able to log the evaluation results during the training process. On a more fine-grained level batch_size should be a scalar and not None as is happening in this case so losses = loss.repeat(batch_size) inside prediction_loop is able to run.

amyeroberts commented 2 months ago

Gentle ping @SunMarc @muellerzr

github-actions[bot] commented 3 days ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers