[BUG] setting PYTORCH_CUDA_ALLOC_CONF in .deepspeed_env raise return code = -11

nomadlx commented 2 months ago

Describe the bug At first I had a lack of GPU OOM, it report me that:

CUDA out of memory.  Tried to allocate 3.74 GiB (GPU 3;  79.35 GiB total capacity;  55.83 GiB already allocated;  3.50 GiB free;  74.51 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.   See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

So I tried to setting PYTORCH_CUDA_ALLOC_CONF when running deepspeed cmd. But it doesn't seem to be working. Therefore, I try add it to .deepspeed_env file, then I raise error:

etuning-worker-0: 08/28/2024 10:47:55 - INFO - llmtuner.data.loader - Loading dataset sft_ydallmath_train.sample_correct_stdtiku_nofigure_dictprompt_20240821.etuning.shuf.jsonl...
etuning-worker-0: 08/28/2024 10:47:55 - INFO - llmtuner.data.template - Replace eos token: <|im_end|>
etuning-worker-0: 08/28/2024 10:47:55 - INFO - llmtuner.data.template - Replace eos token: <|im_end|>
etuning-worker-0: 08/28/2024 10:47:55 - INFO - llmtuner.data.template - Add <|im_start|> to stop words.
etuning-worker-0: 08/28/2024 10:47:55 - INFO - llmtuner.data.template - Add <|im_start|> to stop words.
etuning-worker-0: 08/28/2024 10:47:55 - INFO - llmtuner.data.template - Replace eos token: <|im_end|>
etuning-worker-0: 08/28/2024 10:47:55 - INFO - llmtuner.data.template - Add <|im_start|> to stop words.
etuning-worker-0: 08/28/2024 10:47:55 - INFO - llmtuner.data.template - Replace eos token: <|im_end|>
etuning-worker-0: 08/28/2024 10:47:55 - INFO - llmtuner.data.template - Add <|im_start|> to stop words.
etuning-worker-0: 08/28/2024 10:47:55 - INFO - llmtuner.data.template - Replace eos token: <|im_end|>
etuning-worker-0: 08/28/2024 10:47:55 - INFO - llmtuner.data.template - Add <|im_start|> to stop words.
etuning-worker-0: [2024-08-28 10:47:55,802] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 575
etuning-worker-0: [2024-08-28 10:47:55,855] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 576
etuning-worker-0: [2024-08-28 10:47:55,856] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 577
etuning-worker-0: [2024-08-28 10:47:55,856] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 578
etuning-worker-0: [2024-08-28 10:47:55,857] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 579
etuning-worker-0: [2024-08-28 10:47:55,857] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 580
etuning-worker-0: [2024-08-28 10:47:55,858] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 581
etuning-worker-0: [2024-08-28 10:47:55,858] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 582
etuning-worker-0: [2024-08-28 10:47:55,858] [ERROR] [launch.py:321:sigkill_handler] ['/opt/conda/bin/python3.8', '-u', '/opt/LLaMA-Efficient-Tuning/src/train_bash.py', '--local_rank=7', '--stage', 'sft', '--model_name_or_path', '/exp_dir_etuning/etuning-2024.06.27-14:10:45/models/checkpoint-576-std', '--do_train', '--dataset_dir', '/train_data/etuning_data', '--dataset', 'sft_train.sample_correct_20240821', '--template', 'chatml', '--finetuning_type', 'full', '--output_dir', '/exp_dir_etuning/etuning-2024.08.28-10:47:40/models', '--overwrite_cache', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '15', '--preprocessing_num_workers', '1', '--lr_scheduler_type', 'cosine', '--logging_steps', '2', '--cutoff_len', '2048', '--eval_steps', '67', '--save_total_limit', '4', '--warmup_steps', '10', '--learning_rate', '1e-5', '--max_grad_norm', '1.0', '--num_train_epochs', '4.0', '--val_size', '0.01', '--save_strategy', 'epoch', '--evaluation_strategy', 'steps', '--save_only_model', '--bf16', '--plot_loss', '--deepspeed', '/train_config/ds_config/ds_config_stage3_lowmem.json'] exits with return code = -11
pdsh@etuning-auncher: etuning-worker-0: ssh exited with exit code 245

Expected behavior I want to full training 70B level LLM model, I used 8 A800(80G), In theory it works with bf16(70x2x4=560<80*8)

ds_report output

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 12,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "bf16":{
        "enabled":"auto"
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "allgather_bucket_size": 2e8,
        "reduce_bucket_size": 2e8,
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_fp16_weights_on_model_save": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

tohtana commented 2 months ago

Hi @nomadlx, DeepSpeed reads .deepspeed_env when it launches processes. As long as you see outputs in the log, the processes should have been successfully launched. It is unlikely to happen that DeepSpeed's behavior regarding .deepspeed_env caused the error.

I suggest checking how your application code progresses. The change in PYTORCH_CUDA_ALLOC_CONF might affect it. A simple would be helpful for us if you still see a suspicious behavior in DeepSpeed.

tohtana commented 1 month ago

Since we haven’t received any additional information, we’re closing this issue for now. Please feel free to reopen it if you have more details to share.

microsoft / DeepSpeed

[BUG] setting PYTORCH_CUDA_ALLOC_CONF in .deepspeed_env raise return code = -11 #6454