microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Apache License 2.0
34.89k stars 4.05k forks source link

Changing offload to NVMe instead of CPU causes error: TypeError: unsupported operand type(s) for *: 'NoneType' and 'int' #5124

Closed eshnil2000 closed 6 months ago

eshnil2000 commented 7 months ago

Describe the bug I'm able to run peft pre-trained models from: I can successfully offload to CPU memory using deepspeed. The model I used is "facebook/bart-large"

but when I try to run the exact same model to offload to NVMe, I get this error: TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'

My accelerate config file "ds_zero3_nvme.yaml"

compute_environment: LOCAL_MACHINE
debug: false
  deepspeed_config_file: /home/ubuntu/peft/deepspeed_config.json
distributed_type: DEEPSPEED
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

My deepspeed config file "deepspeed_config.json":

  "zero_optimization": {
    "stage": 3,
    "stage3_max_live_parameters": 3e9,
    "stage3_max_reuse_distance": 3e9,
    "stage3_param_persistence_threshold": 1e5,
    "stage3_prefetch_bucket_size": 5e7,
    "contiguous_gradients": true,
    "overlap_comm": true,
    "reduce_bucket_size": 90000000,
    "sub_group_size": 1e9,
    "offload_optimizer": {
        "device": "nvme",
        "nvme_path": "/mnt/nvme1",
        "pin_memory": true,
        "buffer_count": 4,
        "fast_init": false
  "gradient_clipping": 1.0,
  "wall_clock_breakdown": true,
  "zero_allow_untested_optimizer": false,
  "aio": {
        "block_size": 262144,
        "queue_depth": 32,
        "thread_count": 1,
        "single_submit": false,
        "overlap_events": true

I built deepspeed from source with option DS_BUILD_AIO=1:

TORCH_CUDA_ARCH_LIST="7.5" DS_BUILD_AIO=1 DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 pip install . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check 2>&1 | tee build.log

git clone && cd peft accelerate launch --config_file ds_zero3_nvme.yaml examples/causal_language_modeling/


Traceback (most recent call last): File "/home/ubuntu/peft/examples/causal_language_modeling/", line 367, in main() File "/home/ubuntu/peft/examples/causal_language_modeling/", line 231, in main model, train_dataloader, eval_dataloader, test_dataloader, optimizer, lr_scheduler = accelerator.prepare( File "/home/ubuntu/.local/lib/python3.10/site-packages/accelerate/", line 1220, in prepare result = self._prepare_deepspeed(args) File "/home/ubuntu/.local/lib/python3.10/site-packages/accelerate/", line 1467, in _prepare_deepspeed "train_batch_size": batch_size_per_device TypeError: unsupported operand type(s) for : 'NoneType' and 'int'

Expected behavior A clear and concise description of what you expected to happen.

ds_report output

[2024-02-13 07:19:25,899] [INFO] [] Setting ds_accelerator to cuda (auto detect)
DeepSpeed C++/CUDA extension op report
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
JIT compiled ops requires ninja
ninja .................. [OKAY]
op name ................ installed .. compatible
async_io ............... [YES] ...... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [YES] ...... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
 [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
DeepSpeed general environment info:
torch install path ............... ['/home/ubuntu/.local/lib/python3.10/site-packages/torch']
torch version .................... 2.1.0+cu121
deepspeed install path ........... ['/home/ubuntu/.local/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.12.7+43daf413, 43daf413, master
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.3
deepspeed wheel compiled w. ...... torch 2.1, cuda 12.1
shared memory (/dev/shm) size .... 15.44 GB

System info (please complete the following information):

Launcher context launched using accelerate

eshnil2000 commented 7 months ago

I wanted to update: I am able to manually setup NVMe SSD as a swap device and then force deepspeed to swap to NVMe SSD by limiting the CPU connected RAM available, but I still cant get the optimized, out of the box NVMe offload to function.

  1. I'd appreciate if anyone else has this issue , and if you were able to resolve.
  2. I wanted to find out if manually setting up NVMe SSD as a swap device, and limiting CPU RAM so that I force swap to NVMe SSD is similar in performance to the native DeepSpeed setup for offloading to NVMe
jomayeri commented 7 months ago

Can you post the rest of the ds_config? Based on the stack trace you gave the error is coming from accelerate and involves the batch size setting. Is that the only error?

eshnil2000 commented 7 months ago

The above is the entire ds_config file, there is nothing else. This is the only error. Batch size is set in the main python script:

    accelerator = Accelerator()
    model_name_or_path = "facebook/bart-large"
    dataset_name = "twitter_complaints"
    peft_config = LoraConfig(task_type=TaskType.CAUSAL_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1)
    text_column = "Tweet text"
    label_column = "text_label"
    lr = 3e-3
    num_epochs = 2
    batch_size = 8
    seed = 42
    max_length = 64
    do_test = False

    dataset = load_dataset("ought/raft", dataset_name)
    classes = [k.replace("_", " ") for k in dataset["train"].features["Label"].names]
    dataset =
        lambda x: {"text_label": [classes[label] for label in x["Label"]]},

I tried changing batch size to 1 but same error.

jomayeri commented 7 months ago

Since the error is coming from accelerate and the is no part of the DeepSpeed code in the stack trace, I believe it is a parsing error in their code. After review their code I suggest 1) You switch to the latest version. 2) Set train_micro_batch_size_per_gpu in your ds_config to ensure it is being parsed correctly.

hanhaohh commented 5 months ago

If you are using accelerate, you have to define train_micro_batch_size_per_gpu in your deepspeed configuration file according to this, otherwise its None int int for train_batch_size