Closed rgxb2807 closed 10 months ago
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
cc @pacman100
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
Steps to reproduce behavior:
nvidia/cuda:11.7.1-devel-ubuntu22.04
, install repo requirementspip install audiolm-pytorch==1.6.7
docker run -v /path/to/repo/:/audio --ipc=host --gpus=all -it --entrypoint=/bin/bash audio -i
accelerate config
, when prompted if you want to use DeepSpeed, select Yes and follow prompts to configure for DeepSpeed Stage 2$accelerate launch soundstream_train.py
RuntimeError: Tensor must have a storage_offset divisible by 2
Expected behavior
Without DeepSpeed enabled, the model successfully trains across multiple GPUs. This happens without explicitly wrapping the trainer in the accelerate class because it is instantiated under the hood here.
When DeepSpeed (stage 1 or stage 2) is enabled, this error is returned. Expecting it to run correctly with appropriate offload as configured in the accelerate configuration.