IvoryTower800 commented 10 months ago

Describe the bug I can use my script to finetune model with zero 2 and 3. However, when I use zero infinity offloading parameters. the error occurs:

python: /opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/aio/common/deepspeed_aio_common.cpp:125: int _do_io_complete(long long int, long long int, std::unique_ptr&, std::vector<std::chrono::duration >&): Assertion `n_completes >= min_completes' failed. [2023-12-31 02:55:09,375] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 302

I searched google but without any results. This is really beyond my knowledge and I totally have no idea how to fix it.

Thank you!

To Reproduce Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior A clear and concise description of what you expected to happen.

ds_report output

ds_report

[2023-12-31 02:58:42,953] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

async_io ............... [NO] ....... [OKAY] fused_adam ............. [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] cpu_adagrad ............ [NO] ....... [OKAY] cpu_lion ............... [NO] ....... [OKAY] [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH evoformer_attn ......... [NO] ....... [NO] fused_lamb ............. [NO] ....... [OKAY] fused_lion ............. [NO] ....... [OKAY] inference_core_ops ..... [NO] ....... [OKAY] cutlass_ops ............ [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] ragged_device_ops ...... [NO] ....... [OKAY] ragged_ops ............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1 [WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/opt/conda/lib/python3.10/site-packages/torch'] torch version .................... 2.1.0 deepspeed install path ........... ['/opt/conda/lib/python3.10/site-packages/deepspeed'] deepspeed info ................... 0.12.7+40342055, 40342055, master torch cuda version ............... 11.8 torch hip version ................ None nvcc version ..................... 11.8 deepspeed wheel compiled w. ...... torch 2.1, cuda 11.8 shared memory (/dev/shm) size .... 377.29 GB

System info (please complete the following information):

OS: ubuntu 20.04
GPU count and types: one A100 40GB
Interconnects (if applicable)
Python version : 3.10
Any other relevant info about your setup

Launcher context Are you launching your experiment with the deepspeed launcher, MPI, or something else?

CUDA_VISIBLE_DEVICES=0 deepspeed --num_gpus=1 /code/we_media/LLaMA-Factory/src/train_bash.py \ --model_name_or_path /code/model/writer_1.3b_01_hf \ --dataset_dir /code/dataset/ \ --output_dir /code/output/writer_1.3b \ --flash_attn \ --dataset dpo_data \ --stage dpo \ --do_train True \ --finetuning_type lora \ --template llama2_zh \ --cutoff_len 16384 \ --learning_rate 1e-4 \ --preprocessing_num_workers 8 \ --num_train_epochs 1.0 \ --max_samples 1000000 \ --per_device_train_batch_size 1 \ --gradient_accumulation_steps 32 \ --lr_scheduler_type cosine \ --max_grad_norm 1.0 \ --logging_steps 1 \ --save_steps 100 \ --warmup_steps 0 \ --neftune_noise_alpha 5 \ --lora_rank 128 \ --lora_alpha 256 \ --lora_dropout 0 \ --lora_target all \ --bf16 True \ --plot_loss True \ --overwrite_output_dir True \ --deepspeed ds_config_zero3.json

loadams commented 10 months ago

@IvoryTower800 - can you share your train_bash.py script? And can you share a larger part of the output error log?

xvanQ commented 10 months ago

I got the same error. If directly pip install deepspeed, then this error will occur. If pre-compiled, i.e. DS_BUILD_CPU_ADAM=1 DS_BUILD_AIO=1 DS_BUILD_UTILS=1 pip install deepspeed, it will be found that the code is stuck at a certain time and the CPU usage will continue to increase

hjc3613 commented 3 months ago

same error occured when using nvme offload, my ds config is:

tjruwase commented 3 months ago

@hjc3613, @xvanQ, can you share repro steps? Also, can you confirm that your nvme_path is configured to an NVMe SSD.

hjc3613 commented 3 months ago

@hjc3613, @xvanQ, can you share repro steps? Also, can you confirm that your nvme_path is configured to an NVMe SSD.

yes, I have set nvme_path to a nvme ssd, you can reproduce it by clone my repo: https://github.com/hjc3613/OpenRLHF cd OpenRLHF && sh examples/scripts/train_sft_qwen2.sh, please replace --pretrain and --dataset to your case, the dataset can be any jsonl or excel file contain "input" and "output" column, the deepspeed config can be fond at "openrlhf/utils/deepspeed_utils.py:get_train_ds_config"

chenwuperth commented 2 months ago

I also encountered this. And it should be fixed now in 0.15.1. It seems that the integer overflow caused the negative max_complete during _do_io_complete.

See the diff https://github.com/microsoft/DeepSpeed/commit/e2654bfd1ab431bde088a7501ed01b503daa5ab1

tjruwase commented 1 month ago

I also encountered this. And it should be fixed now in 0.15.1. It seems that the integer overflow caused the negative max_complete during _do_io_complete.

See the diff e2654bf

Thanks! Closing as fixed.

microsoft / DeepSpeed

[BUG] When using Zero-Infinity, Assertion `n_completes >= min_completes' failed #4888