microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.51k stars 4.12k forks source link

[BUG] Infinite Hang During Initial Evaluation Loop Using LLAVA with DeepSpeed #6680

Closed GonyRosenman closed 3 weeks ago

GonyRosenman commented 3 weeks ago

I am encountering an infinite hang during the initial evaluation loop while training a custom LLAVA model using HuggingFace’s Trainer class. This happens only when inserting my own compute_metrics function to the custom trainer class, as outlined below. Notably, the same configuration works, but extremely slowly, if I use the alternative super call (commented in the code).

Steps to Reproduce (High-Level Code)

class LLaVATrainer(Trainer):
    def __init__(self, model, tokenizer, args, **kwargs):
        super().__init__(model=model, tokenizer=tokenizer, args=args,
                          compute_metrics=self.compute_metrics,
                          preprocess_logits_for_metrics=self.preprocess_logits_for_metrics, **kwargs)
        # Alternative version below works fine (no hang, but slow):
        # super().__init__(model=model, tokenizer=tokenizer, args=args, **kwargs)
        self.tokenizer = tokenizer
        self.global_results = defaultdict(list)  # Used to accumulate batch-level scores

trainer = LLaVATrainer(model, tokenizer, training_args)
trainer.evaluate()

Logs (Last Few NCCL Calls)

rack-gamir-v100: NCCL CALL ncclGroupStart()
rack-gamir-v100: NCCL INFO AllGather: opCount f85 sendbuff 0x7f5604000000 recvbuff 0x7f5604000000 count 33590016 datatype 0
rack-gamir-v100: NCCL CALL ncclGroupEnd()

ds_report


[2024-10-28 15:06:30,938] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
 [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/miniconda3/envs/llava/lib/python3.10/site-packages/torch']
torch version .................... 2.1.2+cu121
deepspeed install path ........... ['/home/miniconda3/envs/llava/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.12.6, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.2
deepspeed wheel compiled w. ...... torch 2.1, cuda 12.1
shared memory (/dev/shm) size .... 251.65 GB

nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.116.04   Driver Version: 525.116.04   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:1F:00.0 Off |                    0 |
| N/A   39C    P0    70W / 300W |  31445MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:20:00.0 Off |                    0 |
| N/A   39C    P0    66W / 300W |  30179MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:65:00.0 Off |                    0 |
| N/A   35C    P0    65W / 300W |  30889MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:66:00.0 Off |                    0 |
| N/A   35C    P0    64W / 300W |  31073MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  On   | 00000000:B6:00.0 Off |                    0 |
| N/A   37C    P0    65W / 300W |      3MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:B7:00.0 Off |                    0 |
| N/A   35C    P0    65W / 300W |      3MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:DF:00.0 Off |                    0 |
| N/A   39C    P0    69W / 300W |      3MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000000:E0:00.0 Off |                    0 |
| N/A   38C    P0    68W / 300W |      3MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      8731      C   ...envs/llava/bin/python    31442MiB |
|    1   N/A  N/A      8732      C   ...envs/llava/bin/python    30176MiB |
|    2   N/A  N/A      8733      C   ...envs/llava/bin/python    30886MiB |
|    3   N/A  N/A      8734      C   ...envs/llava/bin/python    31070MiB |

commands to python -m deepspeed.launcher.launch --world_info='{"127.0.0.1":[0,1,2]}' --master_addr=127.0.0.1 --master_port=4242 --no_local_rank /path/to/llava/train.py --lora_enable True --lora_r 128 --lora_alpha 256 --deepspeed ./scripts/zero3.json --num_train_epochs 1 --gradient_checkpointing True --model_name_or_path liuhaotian/llava-v1.5-13b --output_dir ./checkpoints/llava-v1.5-13b-task-lora --do_eval True Environment Configuration NCCL Debug Output: The hanging occurs during ncclGroupStart() and ncclAllGather operations. Kernel Compatibility: Based on similar issues in the DeepSpeed and NCCL communities, this might relate to GPU communication (e.g., issues with peer-to-peer or collective operations). Troubleshooting Attempts Tried NCCL Workarounds:

export NCCL_P2P_DISABLE=1 export NCCL_LL_THRESHOLD=0 These did not resolve the issue.

GonyRosenman commented 3 weeks ago

After further inspection, I discovered that the issue is not directly related to NCCL or DeepSpeed communication. Instead, it has to do with the way I modified trainer.args.label_names. Specifically, I added 'metadata' to label_names to pass a metadata tensor through the collate_fn. This adjustment caused labels to become a tuple of tensors (i.e., (labels, metadata)), which appears to cause an infinite hang when processed by self.gather_function.

I confirmed this behavior with the following minimal reproducible example:

dumm = torch.randint(0, 100, (1, 36)).to('cuda:2')
self.gather_function(((labels, dumm)))

This snippet causes the same infinite hang as in the original issue. It seems the gather_function from the Accelerator class is not designed to handle tuples of tensors, leading to the deadlock.

I’ll need to explore alternatives for passing metadata through the evaluation loop without using label_names to avoid this issue.

jomayeri commented 3 weeks ago

Great, thanks for the update!

luadamek commented 1 week ago

You saved me! Thank you!