huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.93k stars 966 forks source link

When using Deepspeed, RuntimeError: Tensor must have a storage_offset divisible by 2 #2106

Closed rgxb2807 closed 10 months ago

rgxb2807 commented 1 year ago

System Info

- `Accelerate` version: 0.20.3
- Platform: Linux-6.2.0-35-generic-x86_64-with-glibc2.35
- Python version: 3.10.6
- Numpy version: 1.24.3
- PyTorch version (GPU?): 2.0.1+cu117 (True)
- PyTorch XPU available: False
- System RAM: 188.51 GB
- GPU type: NVIDIA GeForce RTX 3090
- `Accelerate` default config:
    - compute_environment: LOCAL_MACHINE
    - distributed_type: DEEPSPEED
    - mixed_precision: no
    - use_cpu: False
    - num_processes: 2
    - machine_rank: 0
    - num_machines: 1
    - rdzv_backend: static
    - same_network: True
    - main_training_function: main
    - deepspeed_config: {'gradient_accumulation_steps': 4, 'offload_optimizer_device': 'cpu', 'offload_param_device': 'cpu', 'zero3_init_flag': False, 'zero_stage': 2}
    - downcast_bf16: no
    - tpu_use_cluster: False
    - tpu_use_sudo: False
    - tpu_env: []

Information

Tasks

Reproduction

Steps to reproduce behavior:

  1. Download and extract dev data - https://us.openslr.org/resources/12/dev-clean.tar.gz
  2. Build Env - Use docker base image nvidia/cuda:11.7.1-devel-ubuntu22.04, install repo requirements
accelerated
beartype>=0.16.1d
einops>=0.7.0d
ema-pytorch>=0.2.2d
encodecd
fairseqd
joblibd
lion-pytorchd
local-attention>=1.9.0d
scikit-learnd
sentencepieced
torch>=1.12d
torchaudiod
transformersd
tqdmd
vector-quantize-pytorch>=1.10.4
  1. Install library pip install audiolm-pytorch==1.6.7
  2. Run Env in interactive mode, mount volume with training data - docker run -v /path/to/repo/:/audio --ipc=host --gpus=all -it --entrypoint=/bin/bash audio -i
  3. Configure accelerate - accelerate config, when prompted if you want to use DeepSpeed, select Yes and follow prompts to configure for DeepSpeed Stage 2
  4. Create a trainer file for SoundStream soundstream_train.py
# imports

from audiolm_pytorch import SoundStream, SoundStreamTrainer,AudioLMSoundStream,

dataset_folder = "/path/to/data"

soundstream = AudioLMSoundStream(
    codebook_size = 1024,
    rq_groups = 2,
    attn_window_size = 128,
    attn_depth = 2
)

data_max_length_seconds = 2 
data_max_length = int(data_max_length_seconds * soundstream.target_sample_hz)

trainer = SoundStreamTrainer(
    soundstream,
    folder = dataset_folder,
    batch_size = 8,
    grad_accum_every = 4,         # effective batch size of 32
    data_max_length=data_max_length,
    save_results_every = 500,
    save_model_every = 1000,
    num_train_steps = 1000000
)
trainer.train()
  1. launch - $accelerate launch soundstream_train.py
  2. Returns error - RuntimeError: Tensor must have a storage_offset divisible by 2
  File "/audio/soundstream_train.py", line 43, in <module>
    trainer.train()
  File "/usr/local/lib/python3.10/dist-packages/audiolm_pytorch/trainer.py", line 572, in train
    logs = self.train_step()
  File "/usr/local/lib/python3.10/dist-packages/audiolm_pytorch/trainer.py", line 441, in train_step
    loss, (recon_loss, multi_spectral_recon_loss, adversarial_loss, feature_loss, all_commitment_loss) = self.soundstream(wave, return_loss_breakdown = True)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1769, in forward
    loss = self.module(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/audiolm_pytorch/soundstream.py", line 852, in forward
    (stft_real_logits, stft_real_intermediates), (stft_fake_logits, stft_fake_intermediates) = map(partial(self.stft_discriminator, return_intermediates=True), (real, fake))
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/audiolm_pytorch/soundstream.py", line 285, in forward
    x = layer(x)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/container.py", line 217, in forward
    input = module(input)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/audiolm_pytorch/soundstream.py", line 308, in forward
    return self.fn(x, **kwargs) + x
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/container.py", line 217, in forward
    input = module(input)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/audiolm_pytorch/soundstream.py", line 197, in forward
    weight, bias = map(torch.view_as_complex, (self.weight, self.bias))
RuntimeError: Tensor must have a storage_offset divisible by 2
Traceback (most recent call last):
  File "/audio/soundstream_train.py", line 43, in <module>
    trainer.train()
  File "/usr/local/lib/python3.10/dist-packages/audiolm_pytorch/trainer.py", line 572, in train
    logs = self.train_step()
  File "/usr/local/lib/python3.10/dist-packages/audiolm_pytorch/trainer.py", line 441, in train_step
    loss, (recon_loss, multi_spectral_recon_loss, adversarial_loss, feature_loss, all_commitment_loss) = self.soundstream(wave, return_loss_breakdown = True)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1769, in forward
    loss = self.module(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/audiolm_pytorch/soundstream.py", line 852, in forward
    (stft_real_logits, stft_real_intermediates), (stft_fake_logits, stft_fake_intermediates) = map(partial(self.stft_discriminator, return_intermediates=True), (real, fake))
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/audiolm_pytorch/soundstream.py", line 285, in forward
    x = layer(x)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/container.py", line 217, in forward
    input = module(input)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/audiolm_pytorch/soundstream.py", line 308, in forward
    return self.fn(x, **kwargs) + x
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/container.py", line 217, in forward
    input = module(input)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/audiolm_pytorch/soundstream.py", line 197, in forward
    weight, bias = map(torch.view_as_complex, (self.weight, self.bias))
RuntimeError: Tensor must have a storage_offset divisible by 2
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 440) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 926, in launch_command
    deepspeed_launcher(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 671, in deepspeed_launcher
    distrib_run.run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
soundstream_train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-10-18_22:04:28
  host      : c6e49f99abae
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 441)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-10-18_22:04:28
  host      : c6e49f99abae
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 440)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================```

Expected behavior

Without DeepSpeed enabled, the model successfully trains across multiple GPUs. This happens without explicitly wrapping the trainer in the accelerate class because it is instantiated under the hood here.

When DeepSpeed (stage 1 or stage 2) is enabled, this error is returned. Expecting it to run correctly with appropriate offload as configured in the accelerate configuration.

github-actions[bot] commented 11 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

muellerzr commented 11 months ago

cc @pacman100

github-actions[bot] commented 10 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.