Open mohummedalee opened 3 months ago
Gentle ping @SunMarc @muellerzr
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
transformers
version: 4.37.2torchrun --nnodes=1 --nproc-per-node=${N_GPUS}
Who can help?
@muellerzr @SunMarc
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I am fine-tuning a RoBERTa with differential privacy (using PyTorch's Opacus). This is the specific script I'm running using
torchrun
for distributed training. My code also relies onprivate-transformers
but as you can see in the stacktrace below, the error happens inside HuggingFace'sTrainer
and I have made a quick fix inside theTrainer
source code (shown below) to make my code work. However, I am opening an issue here to see if this is a general issue that needs fixing.I am executing this script using:
I am able to avoid this error when I make the following hack inside
prediction_loop
:Expected behavior
Expected behavior is that
prediction_loop
runs normally and the function that calls it (evaluate_and_log
) is able to log the evaluation results during the training process. On a more fine-grained levelbatch_size
should be a scalar and notNone
as is happening in this case solosses = loss.repeat(batch_size)
insideprediction_loop
is able to run.