microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.89k stars 4.05k forks source link

Changing offload to NVMe instead of CPU causes error: TypeError: unsupported operand type(s) for *: 'NoneType' and 'int' #5124

Closed eshnil2000 closed 6 months ago

eshnil2000 commented 7 months ago

Describe the bug I'm able to run peft pre-trained models from: https://github.com/huggingface/peft I can successfully offload to CPU memory using deepspeed. The model I used is "facebook/bart-large"

but when I try to run the exact same model to offload to NVMe, I get this error: TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'

My accelerate config file "ds_zero3_nvme.yaml"

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_config_file: /home/ubuntu/peft/deepspeed_config.json
distributed_type: DEEPSPEED
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

My deepspeed config file "deepspeed_config.json":

  "zero_optimization": {
    "stage": 3,
    "stage3_max_live_parameters": 3e9,
    "stage3_max_reuse_distance": 3e9,
    "stage3_param_persistence_threshold": 1e5,
    "stage3_prefetch_bucket_size": 5e7,
    "contiguous_gradients": true,
    "overlap_comm": true,
    "reduce_bucket_size": 90000000,
    "sub_group_size": 1e9,
    "offload_optimizer": {
        "device": "nvme",
        "nvme_path": "/mnt/nvme1",
        "pin_memory": true,
        "buffer_count": 4,
        "fast_init": false
    }
  },
  "gradient_clipping": 1.0,
  "wall_clock_breakdown": true,
  "zero_allow_untested_optimizer": false,
  "aio": {
        "block_size": 262144,
        "queue_depth": 32,
        "thread_count": 1,
        "single_submit": false,
        "overlap_events": true
  }
}

I built deepspeed from source with option DS_BUILD_AIO=1:

TORCH_CUDA_ARCH_LIST="7.5" DS_BUILD_AIO=1 DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 pip install . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check 2>&1 | tee build.log

git clone https://github.com/huggingface/peft && cd peft accelerate launch --config_file ds_zero3_nvme.yaml examples/causal_language_modeling/peft_lora_clm_accelerate_ds_zero3_offload.py

Error:

Traceback (most recent call last): File "/home/ubuntu/peft/examples/causal_language_modeling/peft_lora_clm_accelerate_ds_zero3_offload.py", line 367, in main() File "/home/ubuntu/peft/examples/causal_language_modeling/peft_lora_clm_accelerate_ds_zero3_offload.py", line 231, in main model, train_dataloader, eval_dataloader, test_dataloader, optimizer, lr_scheduler = accelerator.prepare( File "/home/ubuntu/.local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1220, in prepare result = self._prepare_deepspeed(args) File "/home/ubuntu/.local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1467, in _prepare_deepspeed "train_batch_size": batch_size_per_device TypeError: unsupported operand type(s) for : 'NoneType' and 'int'

Expected behavior A clear and concise description of what you expected to happen.

ds_report output

[2024-02-13 07:19:25,899] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [YES] ...... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [YES] ...... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
 [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/ubuntu/.local/lib/python3.10/site-packages/torch']
torch version .................... 2.1.0+cu121
deepspeed install path ........... ['/home/ubuntu/.local/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.12.7+43daf413, 43daf413, master
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.3
deepspeed wheel compiled w. ...... torch 2.1, cuda 12.1
shared memory (/dev/shm) size .... 15.44 GB

System info (please complete the following information):

Launcher context launched using accelerate

eshnil2000 commented 7 months ago

I wanted to update: I am able to manually setup NVMe SSD as a swap device and then force deepspeed to swap to NVMe SSD by limiting the CPU connected RAM available, but I still cant get the optimized, out of the box NVMe offload to function.

  1. I'd appreciate if anyone else has this issue , and if you were able to resolve.
  2. I wanted to find out if manually setting up NVMe SSD as a swap device, and limiting CPU RAM so that I force swap to NVMe SSD is similar in performance to the native DeepSpeed setup for offloading to NVMe
jomayeri commented 7 months ago

Can you post the rest of the ds_config? Based on the stack trace you gave the error is coming from accelerate and involves the batch size setting. Is that the only error?

eshnil2000 commented 7 months ago

The above is the entire ds_config file, there is nothing else. This is the only error. Batch size is set in the main python script:

    accelerator = Accelerator()
    model_name_or_path = "facebook/bart-large"
    dataset_name = "twitter_complaints"
    peft_config = LoraConfig(task_type=TaskType.CAUSAL_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1)
    text_column = "Tweet text"
    label_column = "text_label"
    lr = 3e-3
    num_epochs = 2
    batch_size = 8
    seed = 42
    max_length = 64
    do_test = False
    set_seed(seed)

    dataset = load_dataset("ought/raft", dataset_name)
    classes = [k.replace("_", " ") for k in dataset["train"].features["Label"].names]
    dataset = dataset.map(
        lambda x: {"text_label": [classes[label] for label in x["Label"]]},
        batched=True,
        num_proc=1,
    )

I tried changing batch size to 1 but same error.

jomayeri commented 7 months ago

Since the error is coming from accelerate and the is no part of the DeepSpeed code in the stack trace, I believe it is a parsing error in their code. After review their code I suggest 1) You switch to the latest version. 2) Set train_micro_batch_size_per_gpu in your ds_config to ensure it is being parsed correctly.

hanhaohh commented 5 months ago

If you are using accelerate, you have to define train_micro_batch_size_per_gpu in your deepspeed configuration file according to this, otherwise its None int int for train_batch_size