microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.46k stars 4.12k forks source link

[BUG] (NVMe Offload with Zero3) Not enough buffers 0 for swapping 1 #3062

Open srikanthmalla opened 1 year ago

srikanthmalla commented 1 year ago

Hi, I am currently trying off-the-shelf tranformer example with deepspeed:

BS=4; PYTHONPATH=src USE_TF=0 deepspeed examples/pytorch/translation/run_translation.py \
--model_name_or_path t5-11b --output_dir /tmp/zero3 --overwrite_output_dir --max_train_samples 64 \
--max_eval_samples 64 --max_source_length 128 --max_target_length 128 --val_max_target_length 128 \
--do_train --num_train_epochs 8 --per_device_train_batch_size $BS --per_device_eval_batch_size $BS \
--learning_rate 3e-3 --warmup_steps 500 --predict_with_generate --logging_steps 10 --save_steps 0 \
--eval_steps 5 --group_by_length   --dataset_name wmt16 --dataset_config ro-en --source_lang en \
--target_lang ro --source_prefix "translate English to Romanian: " \
--deepspeed tests/deepspeed/ds_config_zero3_nvme_offload.json

The config file ds_config_zero3_nvme_offload.json has zero3 params from the main documentation (https://huggingface.co/docs/transformers/main_classes/deepspeed#zero3-example) website like this:

{
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "nvme",
            "nvme_path": "/local_nvme",
            "pin_memory": true,
            "buffer_count": 4,
            "fast_init": false
        },
        "offload_param": {
            "device": "nvme",
            "nvme_path": "/local_nvme",
            "pin_memory": true,
            "buffer_count": 5,
            "buffer_size": 1e8,
            "max_in_cpu": 1e9
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },
    "aio": {
            "block_size": 262144,
            "queue_depth": 32,
            "thread_count": 1,
            "single_submit": false,
            "overlap_events": true
        },
}

I get the following error:

Not enough swap in buffers 0 for 1 params, ids = [258]
Num inflight: params 0, buffers 0, numel = 0
Num available params: count = 5, ids = {259, 233, 207, 246, 220}, numel = 167772160
.
.
.
File "/home/xxx/anaconda3/envs/profiler/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 813, in all_gather_coalesced
AssertionError: Not enough buffers 0 for swapping 1
    self._ensure_availability_of_partitioned_params(params)
  File "/home/xxx/anaconda3/envs/profiler/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 999, in _ensure_availability_of_partitioned_params
    swap_in_list[0].nvme_swapper.swap_in(swap_in_list, async_op=False)
  File "/home/xxx/anaconda3/envs/profiler/lib/python3.8/site-packages/deepspeed/runtime/swap_tensor/partitioned_param_swapper.py", line 308, in swap_in
    assert len(swap_in_paths) <= len(self.available_buffer_ids), f"Not enough buffers {len(self.available_buffer_ids)} for swapping {len(swap_in_paths)}"
AssertionError: Not enough buffers 0 for swapping 1

I don't get this error if the offload_param device is set to cpu instead of nvme. I am curious why this is happening and how to fix this. Also, this happens regardless I add aio params or remove all of them. Please let me know.

Thank you!

srikanthmalla commented 1 year ago

Hi @HeyangQin and @tjruwase , Could you help me resolve this issue? This is happening irrespective of training/inference, happens only during parameter offloading to NVMe.

tjruwase commented 1 year ago

@srikanthmalla, how many GPUs are you running? Also, please share the impact of the following ds_config adjustments

  1. Set stage3_max_reuse_distance to 0
  2. Increase buffer_count in offload_param to 10, 15, and 20.
tjruwase commented 1 year ago

@srikanthmalla, did either of these changes help?

srikanthmalla commented 1 year ago

Hi @tjruwase, both didn't help.

tjruwase commented 1 year ago

Thanks for the update. I was able to reproduce the problem. My initial look suggests that this is due to the optimization of prefetching and caching layer parameters to reduce offload overheads. The following error message shows that the buffer_count:5 setting of offload_param is eventually exceeded:

image

The meaning of the above message is that we are unable to add param 258 to the offload cache because it is full, containing params 259, 233, 207, 246, 220

I was able to work around this issue by disabling caching (i.e., "stage3_max_reuse_distance": 0). But since that did not work for you, I am concerned perhaps I have reproduced a different problem. Can you please try the following to help this investigation?

  1. Share how many GPUs you are using? I reproduced on 4xV100-16GB.
  2. Use this branch: https://github.com/microsoft/DeepSpeed/tree/olruwase/issue_3062. Please note that this branch includes offload debug prints that will increase log size.
  3. Run your original failing configuration and share the log.
  4. Run with disabled caching ("stage3_max_reuse_distance": 0) and disabled prefetching ("stage3_prefetch_bucket_size": 0) to see if that combination avoids the error.
tjruwase commented 1 year ago

@srikanthmalla, just to clarify, we recognize that there is an underlying issue that needs to be fixed. My requests above are just to help further understand this issue. Thanks!

chongxiaoc commented 1 year ago

Hi, I'm getting same issue when using deepspeed 0.10.0 with huggingface transformers.

727AssertionError: Not enough buffers 0 for swapping 1726    assert len(swap_in_paths) <= len(725  File "/usr/local/lib/python3.9/dist-packages/deepspeed/runtime/swap_tensor/partitioned_param_swapper.py", line 297, in swap_in

Deepspeed config:

    zero_optimization:
      stage: 3
      offload_optimizer:
        device: nvme
        nvme_path: /tmp/nvme_offoad
      offload_param:
        device: nvme
        nvme_path: /tmp/nvme_offoad