huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.63k stars 927 forks source link

Error when saving ckpt on slurm #2960

Open Nightmare-n opened 1 month ago

Nightmare-n commented 1 month ago

System Info

- `Accelerate` version: 0.33.0
- Platform: Linux-3.10.0-957.el7.x86_64-x86_64-with-glibc2.17
- `accelerate` bash location: /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/bin/accelerate
- Python version: 3.8.19
- Numpy version: 1.24.3
- PyTorch version (GPU?): 2.4.0 (False)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 754.36 GB
- `Accelerate` default config:
    - compute_environment: LOCAL_MACHINE
    - distributed_type: MULTI_GPU
    - mixed_precision: fp16
    - use_cpu: False
    - debug: False
    - num_processes: 2
    - machine_rank: 0
    - num_machines: 1
    - gpu_ids: 1,2
    - rdzv_backend: static
    - same_network: True
    - main_training_function: main
    - enable_cpu_affinity: False
    - downcast_bf16: no
    - tpu_use_cluster: False
    - tpu_use_sudo: False
    - tpu_env: []

Information

Tasks

Reproduction

1: script:

#!/bin/bash
#SBATCH -J train_job
#SBATCH -p partition

#SBATCH --nodes=2

#SBATCH --gres=gpu:8

#SBATCH --ntasks-per-node=1

# some parameters are ignored

srun accelerate launch --config_file ~/.cache/huggingface/accelerate/fsdp_multinodes.yaml \
  --num_processes $((SLURM_NNODES * GPUS_PER_NODE)) --machine_rank $NODE_RANK --num_machines $SLURM_JOB_NUM_NODES --rdzv_backend c10d --main_process_ip $MASTER_ADDR --main_process_port $MASTER_PORT \
  train.py
  1. code:

    def save_checkpoint(self):
    assert self.accelerator.sync_gradients
    if self.global_iter == self.last_saved_iter:
        self.logger.warning(
            f"checkpoint for global_iter: {self.global_iter} has been saved, pass"
        )
        return
    self.accelerator.project_configuration.iteration = self.global_iter + 1
    self.accelerator.save_state()
    self.last_saved_iter = self.global_iter
    self.logger.info(f"Save checkpoint for global_iter: {self.global_iter + 1}")
  2. config:

    compute_environment: LOCAL_MACHINE
    debug: true
    distributed_type: FSDP
    downcast_bf16: 'no'
    enable_cpu_affinity: false
    fsdp_config:
    fsdp_activation_checkpointing: true
    fsdp_auto_wrap_policy: SIZE_BASED_WRAP
    fsdp_backward_prefetch: BACKWARD_PRE
    fsdp_cpu_ram_efficient_loading: false
    fsdp_forward_prefetch: false
    fsdp_min_num_params: 20000000
    fsdp_offload_params: false
    fsdp_sharding_strategy: FULL_SHARD
    fsdp_state_dict_type: SHARDED_STATE_DICT
    fsdp_sync_module_states: true
    fsdp_use_orig_params: true
    machine_rank: 0
    main_process_ip: 10.140.24.79
    main_process_port: 10078
    main_training_function: main
    mixed_precision: fp16
    num_machines: 2
    num_processes: 4
    rdzv_backend: static
    same_network: true
    tpu_env: []
    tpu_use_cluster: false
    tpu_use_sudo: false
    use_cpu: false

Expected behavior

It works well with one machine node, but an error occurs when using two machines.

Nightmare-n commented 1 month ago
07/26/2024 10:59:38   INFO  Saving FSDP model
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:737: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  local_shape = tensor.shape
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:749: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.shape,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:737: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  local_shape = tensor.shape
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:737: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  local_shape = tensor.shape
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:751: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.dtype,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:737: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  local_shape = tensor.shape
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:752: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.device,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:749: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.shape,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:737: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  local_shape = tensor.shape
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:749: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.shape,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:749: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.shape,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:751: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.dtype,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:751: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.dtype,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:749: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.shape,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:752: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.device,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:751: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.dtype,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:737: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  local_shape = tensor.shape
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:752: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.device,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:749: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.shape,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:751: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.dtype,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:752: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.device,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:751: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.dtype,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:752: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.device,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:752: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.device,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:737: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  local_shape = tensor.shape
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:737: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  local_shape = tensor.shape
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:749: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.shape,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:749: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.shape,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:751: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.dtype,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:751: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.dtype,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:752: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.device,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:752: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.device,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:737: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  local_shape = tensor.shape
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:737: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  local_shape = tensor.shape
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:749: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.shape,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:749: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.shape,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:751: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.dtype,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:751: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.dtype,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:752: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.device,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:752: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.device,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:737: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  local_shape = tensor.shape
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:737: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  local_shape = tensor.shape
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:749: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.shape,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:737: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  local_shape = tensor.shape
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:751: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.dtype,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:749: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.shape,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:752: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.device,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:749: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.shape,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:737: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  local_shape = tensor.shape
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:751: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.dtype,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:749: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.shape,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:751: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.dtype,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:751: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.dtype,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:752: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.device,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:737: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  local_shape = tensor.shape
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:752: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.device,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:752: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.device,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:737: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  local_shape = tensor.shape
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:749: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.shape,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:749: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.shape,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:751: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.dtype,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:751: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.dtype,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:752: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.device,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:752: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.device,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/accelerate/utils/fsdp_utils.py:90: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead.
  dist_cp.save_state_dict(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/accelerate/utils/fsdp_utils.py:90: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead.
  dist_cp.save_state_dict(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/accelerate/utils/fsdp_utils.py:90: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead.
  dist_cp.save_state_dict(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/accelerate/utils/fsdp_utils.py:90: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead.
  dist_cp.save_state_dict(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/accelerate/utils/fsdp_utils.py:90: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead.
  dist_cp.save_state_dict(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/accelerate/utils/fsdp_utils.py:90: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead.
  dist_cp.save_state_dict(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/accelerate/utils/fsdp_utils.py:90: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead.
  dist_cp.save_state_dict(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/accelerate/utils/fsdp_utils.py:90: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead.
  dist_cp.save_state_dict(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/accelerate/utils/fsdp_utils.py:90: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead.
  dist_cp.save_state_dict(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/accelerate/utils/fsdp_utils.py:90: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead.
  dist_cp.save_state_dict(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/accelerate/utils/fsdp_utils.py:90: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead.
  dist_cp.save_state_dict(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/accelerate/utils/fsdp_utils.py:90: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead.
  dist_cp.save_state_dict(
07/26/2024 10:59:41   INFO  Saving model to /mnt/petrelfs/yanghonghui/projects/OpenPCDet_torch2.4.0/output/game_models/sd3_rope/debug/checkpoints/checkpoint_1/pytorch_model_fsdp_0
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/accelerate/utils/fsdp_utils.py:90: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead.
  dist_cp.save_state_dict(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/accelerate/utils/fsdp_utils.py:90: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead.
  dist_cp.save_state_dict(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/accelerate/utils/fsdp_utils.py:90: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead.
  dist_cp.save_state_dict(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/accelerate/utils/fsdp_utils.py:90: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead.
  dist_cp.save_state_dict(
SH-IDC1-10-140-24-106:147874:192057 [6] NCCL INFO Channel 00/1 : 6[6] -> 0[0] via P2P/CUMEM/read
SH-IDC1-10-140-24-106:147866:192060 [1] NCCL INFO Channel 00/1 : 1[1] -> 0[0] via P2P/CUMEM/read
SH-IDC1-10-140-24-95:214042:258021 [7] NCCL INFO Channel 01/1 : 15[7] -> 0[0] [send] via NET/IB/0(8)/GDRDMA/Shared
SH-IDC1-10-140-24-95:214040:258019 [6] NCCL INFO Channel 01/1 : 14[6] -> 0[0] [send] via NET/IB/0(8)/GDRDMA/Shared
SH-IDC1-10-140-24-106:147867:192061 [2] NCCL INFO Channel 00/1 : 2[2] -> 0[0] via P2P/CUMEM/read
SH-IDC1-10-140-24-95:214035:258020 [4] NCCL INFO Channel 01/1 : 12[4] -> 0[0] [send] via NET/IB/0(8)/GDRDMA/Shared
SH-IDC1-10-140-24-106:147869:192059 [4] NCCL INFO Channel 00/1 : 4[4] -> 0[0] via P2P/CUMEM/read
SH-IDC1-10-140-24-95:214038:258026 [5] NCCL INFO Channel 01/1 : 13[5] -> 0[0] [send] via NET/IB/0(8)/GDRDMA/Shared
SH-IDC1-10-140-24-106:147872:192064 [5] NCCL INFO Channel 00/1 : 5[5] -> 0[0] via P2P/CUMEM/read
SH-IDC1-10-140-24-95:214033:258022 [2] NCCL INFO Channel 01/1 : 10[2] -> 0[0] [send] via NET/IB/0(8)/GDRDMA/Shared
SH-IDC1-10-140-24-106:147876:192058 [7] NCCL INFO Channel 00/1 : 7[7] -> 0[0] via P2P/CUMEM/read
SH-IDC1-10-140-24-95:214034:258023 [3] NCCL INFO Channel 01/1 : 11[3] -> 0[0] [send] via NET/IB/0(8)/GDRDMA/Shared
SH-IDC1-10-140-24-95:214032:258024 [1] NCCL INFO Channel 01/1 : 9[1] -> 0[0] [send] via NET/IB/0(8)/GDRDMA/Shared
SH-IDC1-10-140-24-106:147868:192062 [3] NCCL INFO Channel 00/1 : 3[3] -> 0[0] via P2P/CUMEM/read
SH-IDC1-10-140-24-95:214031:258025 [0] NCCL INFO Channel 01/1 : 8[0] -> 0[0] [send] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-106:147865:192063 [0] NCCL INFO Channel 01/1 : 15[7] -> 0[0] [receive] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-106:147865:192063 [0] NCCL INFO Channel 01/1 : 14[6] -> 0[0] [receive] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-106:147865:192063 [0] NCCL INFO Channel 01/1 : 13[5] -> 0[0] [receive] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-106:147865:192063 [0] NCCL INFO Channel 01/1 : 12[4] -> 0[0] [receive] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-106:147865:192063 [0] NCCL INFO Channel 01/1 : 11[3] -> 0[0] [receive] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-106:147865:192063 [0] NCCL INFO Channel 01/1 : 10[2] -> 0[0] [receive] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-106:147865:192063 [0] NCCL INFO Channel 01/1 : 9[1] -> 0[0] [receive] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-106:147865:192063 [0] NCCL INFO Channel 01/1 : 8[0] -> 0[0] [receive] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-95:214040:258033 [6] NCCL INFO Channel 01/1 : 0[0] -> 14[6] [receive] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-106:147865:192085 [0] NCCL INFO Channel 00/1 : 0[0] -> 1[1] via P2P/CUMEM/read
SH-IDC1-10-140-24-95:214035:258034 [4] NCCL INFO Channel 01/1 : 0[0] -> 12[4] [receive] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-95:214042:258032 [7] NCCL INFO Channel 01/1 : 0[0] -> 15[7] [receive] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-95:214034:258035 [3] NCCL INFO Channel 01/1 : 0[0] -> 11[3] [receive] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-95:214033:258030 [2] NCCL INFO Channel 01/1 : 0[0] -> 10[2] [receive] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-95:214031:258037 [0] NCCL INFO Channel 01/1 : 0[0] -> 8[0] [receive] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-106:147865:192085 [0] NCCL INFO Channel 00/1 : 0[0] -> 2[2] via P2P/CUMEM/read
SH-IDC1-10-140-24-95:214032:258036 [1] NCCL INFO Channel 01/1 : 0[0] -> 9[1] [receive] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-106:147865:192085 [0] NCCL INFO Channel 00/1 : 0[0] -> 3[3] via P2P/CUMEM/read
SH-IDC1-10-140-24-95:214038:258038 [5] NCCL INFO Channel 01/1 : 0[0] -> 13[5] [receive] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-106:147865:192085 [0] NCCL INFO Channel 00/1 : 0[0] -> 4[4] via P2P/CUMEM/read
SH-IDC1-10-140-24-106:147865:192085 [0] NCCL INFO Channel 00/1 : 0[0] -> 5[5] via P2P/CUMEM/read
SH-IDC1-10-140-24-106:147865:192085 [0] NCCL INFO Channel 00/1 : 0[0] -> 6[6] via P2P/CUMEM/read
SH-IDC1-10-140-24-106:147865:192085 [0] NCCL INFO Channel 00/1 : 0[0] -> 7[7] via P2P/CUMEM/read
SH-IDC1-10-140-24-106:147865:192085 [0] NCCL INFO Channel 01/1 : 0[0] -> 8[0] [send] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-106:147865:192085 [0] NCCL INFO Channel 01/1 : 0[0] -> 9[1] [send] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-106:147865:192085 [0] NCCL INFO Channel 01/1 : 0[0] -> 10[2] [send] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-106:147865:192085 [0] NCCL INFO Channel 01/1 : 0[0] -> 11[3] [send] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-106:147865:192085 [0] NCCL INFO Channel 01/1 : 0[0] -> 12[4] [send] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-106:147865:192085 [0] NCCL INFO Channel 01/1 : 0[0] -> 13[5] [send] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-106:147865:192085 [0] NCCL INFO Channel 01/1 : 0[0] -> 14[6] [send] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-106:147865:192085 [0] NCCL INFO Channel 01/1 : 0[0] -> 15[7] [send] via NET/IB/0/GDRDMA/Shared
mlx5: SH-IDC1-10-140-24-95: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000001 00000000 00000000 00000000
00000000 00008914 100009e8 000045d2
mlx5: SH-IDC1-10-140-24-95: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000001 00000000 00000000 00000000
00000000 00008914 100009e6 00004bd2

SH-IDC1-10-140-24-95:214034:214782 [3] transport/net_ib.cc:1696 NCCL WARN NET/IB : Got completion from peer 10.140.24.106<37760> with status=11 opcode=129 len=0 vendor err 137 (Flush)
SH-IDC1-10-140-24-95:214034:214782 [3] NCCL INFO transport/net.cc:1374 -> 6

SH-IDC1-10-140-24-95:214033:214783 [2] transport/net_ib.cc:1696 NCCL WARN NET/IB : Got completion from peer 10.140.24.106<27076> with status=11 opcode=129 len=0 vendor err 137 (Flush)
SH-IDC1-10-140-24-95:214034:214782 [3] NCCL INFO proxy.cc:694 -> 6
SH-IDC1-10-140-24-95:214033:214783 [2] NCCL INFO transport/net.cc:1374 -> 6
SH-IDC1-10-140-24-95:214034:214782 [3] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]
SH-IDC1-10-140-24-95:214033:214783 [2] NCCL INFO proxy.cc:694 -> 6
SH-IDC1-10-140-24-95:214033:214783 [2] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]
mlx5: SH-IDC1-10-140-24-95: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000001 00000000 00000000 00000000
00000000 00008914 100009ec 000041d2

SH-IDC1-10-140-24-95:214034:214782 [3] transport/net_ib.cc:1696 NCCL WARN NET/IB : Got completion from peer 10.140.24.106<37760> with status=5 opcode=129 len=3 vendor err 249 (Flush)

SH-IDC1-10-140-24-95:214033:214783 [2] transport/net_ib.cc:1696 NCCL WARN NET/IB : Got completion from peer 10.140.24.106<27076> with status=5 opcode=129 len=2 vendor err 249 (Flush)
SH-IDC1-10-140-24-95:214034:214782 [3] NCCL INFO transport/net.cc:1374 -> 6
SH-IDC1-10-140-24-95:214033:214783 [2] NCCL INFO transport/net.cc:1374 -> 6
SH-IDC1-10-140-24-95:214034:214782 [3] NCCL INFO proxy.cc:694 -> 6
SH-IDC1-10-140-24-95:214033:214783 [2] NCCL INFO proxy.cc:694 -> 6
SH-IDC1-10-140-24-95:214034:214782 [3] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]
SH-IDC1-10-140-24-95:214033:214783 [2] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]

SH-IDC1-10-140-24-95:214033:214783 [2] transport/net_ib.cc:1696 NCCL WARN NET/IB : Got completion from peer 10.140.24.106<27076> with status=5 opcode=129 len=2 vendor err 249 (Flush)
SH-IDC1-10-140-24-95:214033:214783 [2] NCCL INFO transport/net.cc:1374 -> 6
SH-IDC1-10-140-24-95:214033:214783 [2] NCCL INFO proxy.cc:694 -> 6
SH-IDC1-10-140-24-95:214033:214783 [2] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]
mlx5: SH-IDC1-10-140-24-95: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000001 00000000 00000000 00000000
00000000 00008914 100009ea 000047d2

SH-IDC1-10-140-24-95:214038:214779 [5] transport/net_ib.cc:1696 NCCL WARN NET/IB : Got completion from peer 10.140.24.106<47174> with status=11 opcode=129 len=0 vendor err 137 (Flush)
SH-IDC1-10-140-24-95:214038:214779 [5] NCCL INFO transport/net.cc:1374 -> 6
SH-IDC1-10-140-24-95:214038:214779 [5] NCCL INFO proxy.cc:694 -> 6
SH-IDC1-10-140-24-95:214038:214779 [5] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]

SH-IDC1-10-140-24-95:214038:214779 [5] transport/net_ib.cc:1696 NCCL WARN NET/IB : Got completion from peer 10.140.24.106<47174> with status=5 opcode=129 len=5 vendor err 249 (Flush)
SH-IDC1-10-140-24-95:214038:214779 [5] NCCL INFO transport/net.cc:1374 -> 6
SH-IDC1-10-140-24-95:214038:214779 [5] NCCL INFO proxy.cc:694 -> 6
SH-IDC1-10-140-24-95:214038:214779 [5] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]

SH-IDC1-10-140-24-95:214038:214779 [5] transport/net_ib.cc:1696 NCCL WARN NET/IB : Got completion from peer 10.140.24.106<47174> with status=5 opcode=129 len=5 vendor err 249 (Flush)
SH-IDC1-10-140-24-95:214038:214779 [5] NCCL INFO transport/net.cc:1374 -> 6
SH-IDC1-10-140-24-95:214038:214779 [5] NCCL INFO proxy.cc:694 -> 6
SH-IDC1-10-140-24-95:214038:214779 [5] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]

SH-IDC1-10-140-24-95:214035:214781 [4] transport/net_ib.cc:1696 NCCL WARN NET/IB : Got completion from peer 10.140.24.106<5776> with status=11 opcode=129 len=0 vendor err 137 (Flush)
SH-IDC1-10-140-24-95:214035:214781 [4] NCCL INFO transport/net.cc:1374 -> 6
SH-IDC1-10-140-24-95:214035:214781 [4] NCCL INFO proxy.cc:694 -> 6
SH-IDC1-10-140-24-95:214035:214781 [4] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]

SH-IDC1-10-140-24-95:214035:214781 [4] transport/net_ib.cc:1696 NCCL WARN NET/IB : Got completion from peer 10.140.24.106<5776> with status=5 opcode=129 len=4 vendor err 249 (Flush)
SH-IDC1-10-140-24-95:214035:214781 [4] NCCL INFO transport/net.cc:1374 -> 6
SH-IDC1-10-140-24-95:214035:214781 [4] NCCL INFO proxy.cc:694 -> 6
SH-IDC1-10-140-24-95:214035:214781 [4] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]

SH-IDC1-10-140-24-95:214035:214781 [4] transport/net_ib.cc:1696 NCCL WARN NET/IB : Got completion from peer 10.140.24.106<5776> with status=5 opcode=129 len=4 vendor err 249 (Flush)
SH-IDC1-10-140-24-95:214035:214781 [4] NCCL INFO transport/net.cc:1374 -> 6
SH-IDC1-10-140-24-95:214035:214781 [4] NCCL INFO proxy.cc:694 -> 6
SH-IDC1-10-140-24-95:214035:214781 [4] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]
mlx5: SH-IDC1-10-140-24-95: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000001 00000000 00000000 00000000
00000000 00008914 100009f0 00005dd2

SH-IDC1-10-140-24-95:214042:214778 [7] transport/net_ib.cc:1696 NCCL WARN NET/IB : Got completion from peer 10.140.24.106<61176> with status=11 opcode=129 len=0 vendor err 137 (Flush)
SH-IDC1-10-140-24-95:214042:214778 [7] NCCL INFO transport/net.cc:1374 -> 6
SH-IDC1-10-140-24-95:214042:214778 [7] NCCL INFO proxy.cc:694 -> 6
SH-IDC1-10-140-24-95:214042:214778 [7] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]

SH-IDC1-10-140-24-95:214042:214778 [7] transport/net_ib.cc:1696 NCCL WARN NET/IB : Got completion from peer 10.140.24.106<61176> with status=5 opcode=129 len=7 vendor err 249 (Flush)
SH-IDC1-10-140-24-95:214042:214778 [7] NCCL INFO transport/net.cc:1374 -> 6
SH-IDC1-10-140-24-95:214042:214778 [7] NCCL INFO proxy.cc:694 -> 6
SH-IDC1-10-140-24-95:214042:214778 [7] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]

SH-IDC1-10-140-24-95:214042:214778 [7] transport/net_ib.cc:1696 NCCL WARN NET/IB : Got completion from peer 10.140.24.106<61176> with status=5 opcode=129 len=7 vendor err 249 (Flush)
SH-IDC1-10-140-24-95:214042:214778 [7] NCCL INFO transport/net.cc:1374 -> 6
SH-IDC1-10-140-24-95:214042:214778 [7] NCCL INFO proxy.cc:694 -> 6
SH-IDC1-10-140-24-95:214042:214778 [7] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]

SH-IDC1-10-140-24-95:214042:214778 [7] transport/net_ib.cc:1696 NCCL WARN NET/IB : Got completion from peer 10.140.24.106<61176> with status=5 opcode=129 len=7 vendor err 249 (Flush)
SH-IDC1-10-140-24-95:214042:214778 [7] NCCL INFO transport/net.cc:1374 -> 6
SH-IDC1-10-140-24-95:214042:214778 [7] NCCL INFO proxy.cc:694 -> 6
SH-IDC1-10-140-24-95:214042:214778 [7] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]
mlx5: SH-IDC1-10-140-24-95: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000001 00000000 00000000 00000000
00000000 00008914 100009ee 000043d2

SH-IDC1-10-140-24-95:214040:214780 [6] transport/net_ib.cc:1696 NCCL WARN NET/IB : Got completion from peer 10.140.24.106<37762> with status=11 opcode=129 len=0 vendor err 137 (Flush)
SH-IDC1-10-140-24-95:214040:214780 [6] NCCL INFO transport/net.cc:1374 -> 6
SH-IDC1-10-140-24-95:214040:214780 [6] NCCL INFO proxy.cc:694 -> 6
SH-IDC1-10-140-24-95:214040:214780 [6] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]

SH-IDC1-10-140-24-95:214040:214780 [6] transport/net_ib.cc:1696 NCCL WARN NET/IB : Got completion from peer 10.140.24.106<37762> with status=5 opcode=129 len=6 vendor err 249 (Flush)
SH-IDC1-10-140-24-95:214040:214780 [6] NCCL INFO transport/net.cc:1374 -> 6
SH-IDC1-10-140-24-95:214040:214780 [6] NCCL INFO proxy.cc:694 -> 6
SH-IDC1-10-140-24-95:214040:214780 [6] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]

SH-IDC1-10-140-24-95:214040:214780 [6] transport/net_ib.cc:1696 NCCL WARN NET/IB : Got completion from peer 10.140.24.106<37762> with status=5 opcode=129 len=6 vendor err 249 (Flush)
SH-IDC1-10-140-24-95:214040:214780 [6] NCCL INFO transport/net.cc:1374 -> 6
SH-IDC1-10-140-24-95:214040:214780 [6] NCCL INFO proxy.cc:694 -> 6
SH-IDC1-10-140-24-95:214040:214780 [6] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]
[rank15]:[E726 10:59:43.692836266 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 15] Exception (either an error or timeout) detected by watchdog at work: 184, last enqueued NCCL work: 184, last completed NCCL work: 183.
[rank10]:[E726 10:59:43.692913143 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 10] Exception (either an error or timeout) detected by watchdog at work: 184, last enqueued NCCL work: 184, last completed NCCL work: 183.
[rank11]:[E726 10:59:43.692913153 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 11] Exception (either an error or timeout) detected by watchdog at work: 184, last enqueued NCCL work: 184, last completed NCCL work: 183.
[rank13]:[E726 10:59:43.693201512 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 13] Exception (either an error or timeout) detected by watchdog at work: 184, last enqueued NCCL work: 184, last completed NCCL work: 183.
[rank12]:[E726 10:59:43.695610972 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 12] Exception (either an error or timeout) detected by watchdog at work: 184, last enqueued NCCL work: 184, last completed NCCL work: 183.
[rank14]:[E726 10:59:43.724105047 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 14] Exception (either an error or timeout) detected by watchdog at work: 184, last enqueued NCCL work: 184, last completed NCCL work: 183.
SH-IDC1-10-140-24-95:214031:214765 [0] NCCL INFO [Service thread] Connection closed by localRank 2
SH-IDC1-10-140-24-95:214033:214763 [2] NCCL INFO [Service thread] Connection closed by localRank 2
SH-IDC1-10-140-24-95:214033:214190 [0] NCCL INFO comm 0x99e59c0 rank 10 nranks 16 cudaDev 2 busId 65000 - Abort COMPLETE
[rank10]:[E726 10:59:44.260357278 ProcessGroupNCCL.cpp:621] [Rank 10] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank10]:[E726 10:59:44.260366596 ProcessGroupNCCL.cpp:627] [Rank 10] To avoid data inconsistency, we are taking the entire process down.
[rank10]:[E726 10:59:44.260431229 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 10] Process group watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.20.5
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
Last error:
NET/IB : Got completion from peer 10.140.24.106<27076> with status=5 opcode=129 len=2 vendor err 249 (Flush)
Exception raised from checkForNCCLErrorsInternal at /opt/conda/conda-bld/pytorch_1720538622298/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1892 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc0e41f2f86 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::shared_ptr<c10d::NCCLComm>&) + 0x220 (0x7fc0e54df7f0 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7c (0x7fc0e54dfa3c in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::watchdogHandler() + 0x213 (0x7fc0e54e6923 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc0e54e8d2c in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0xdbbf4 (0x7fc147bb6bf4 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #6: <unknown function> + 0x7dd5 (0x7fc156e4edd5 in /lib64/libpthread.so.0)
frame #7: clone + 0x6d (0x7fc15646eead in /lib64/libc.so.6)

SH-IDC1-10-140-24-95:214031:214765 [0] NCCL INFO [Service thread] Connection closed by localRank 4
SH-IDC1-10-140-24-95:214035:214760 [4] NCCL INFO [Service thread] Connection closed by localRank 4
SH-IDC1-10-140-24-95:214031:214765 [0] NCCL INFO [Service thread] Connection closed by localRank 5
SH-IDC1-10-140-24-95:214038:214761 [5] NCCL INFO [Service thread] Connection closed by localRank 5
SH-IDC1-10-140-24-95:214031:214765 [0] NCCL INFO [Service thread] Connection closed by localRank 6
SH-IDC1-10-140-24-95:214040:214762 [6] NCCL INFO [Service thread] Connection closed by localRank 6
SH-IDC1-10-140-24-95:214031:214765 [0] NCCL INFO [Service thread] Connection closed by localRank 7
SH-IDC1-10-140-24-95:214042:214774 [7] NCCL INFO [Service thread] Connection closed by localRank 7
SH-IDC1-10-140-24-95:214035:214179 [0] NCCL INFO comm 0x93a8c00 rank 12 nranks 16 cudaDev 4 busId a3000 - Abort COMPLETE
[rank12]:[E726 10:59:44.284494485 ProcessGroupNCCL.cpp:621] [Rank 12] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank12]:[E726 10:59:44.284505756 ProcessGroupNCCL.cpp:627] [Rank 12] To avoid data inconsistency, we are taking the entire process down.
[rank12]:[E726 10:59:44.284578655 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 12] Process group watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.20.5
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
Last error:
NET/IB : Got completion from peer 10.140.24.106<5776> with status=5 opcode=129 len=4 vendor err 249 (Flush)
Exception raised from checkForNCCLErrorsInternal at /opt/conda/conda-bld/pytorch_1720538622298/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1892 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f0b05001f86 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::shared_ptr<c10d::NCCLComm>&) + 0x220 (0x7f0b062ee7f0 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7c (0x7f0b062eea3c in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::watchdogHandler() + 0x213 (0x7f0b062f5923 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f0b062f7d2c in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0xdbbf4 (0x7f0b689c5bf4 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #6: <unknown function> + 0x7dd5 (0x7f0b77c5ddd5 in /lib64/libpthread.so.0)
frame #7: clone + 0x6d (0x7f0b7727dead in /lib64/libc.so.6)

SH-IDC1-10-140-24-95:214038:214193 [0] NCCL INFO comm 0xaa13c80 rank 13 nranks 16 cudaDev 5 busId a8000 - Abort COMPLETE
[rank13]:[E726 10:59:44.290108228 ProcessGroupNCCL.cpp:621] [Rank 13] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank13]:[E726 10:59:44.290121924 ProcessGroupNCCL.cpp:627] [Rank 13] To avoid data inconsistency, we are taking the entire process down.
[rank13]:[E726 10:59:44.290391618 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 13] Process group watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.20.5
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
Last error:
NET/IB : Got completion from peer 10.140.24.106<47174> with status=5 opcode=129 len=5 vendor err 249 (Flush)
Exception raised from checkForNCCLErrorsInternal at /opt/conda/conda-bld/pytorch_1720538622298/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1892 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f7832bd1f86 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::shared_ptr<c10d::NCCLComm>&) + 0x220 (0x7f7833ebe7f0 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7c (0x7f7833ebea3c in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::watchdogHandler() + 0x213 (0x7f7833ec5923 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f7833ec7d2c in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0xdbbf4 (0x7f7896595bf4 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #6: <unknown function> + 0x7dd5 (0x7f78a582ddd5 in /lib64/libpthread.so.0)
frame #7: clone + 0x6d (0x7f78a4e4dead in /lib64/libc.so.6)

SH-IDC1-10-140-24-95:214040:214204 [0] NCCL INFO comm 0x9a22940 rank 14 nranks 16 cudaDev 6 busId e1000 - Abort COMPLETE
[rank14]:[E726 10:59:44.293817263 ProcessGroupNCCL.cpp:621] [Rank 14] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank14]:[E726 10:59:44.293827222 ProcessGroupNCCL.cpp:627] [Rank 14] To avoid data inconsistency, we are taking the entire process down.
[rank14]:[E726 10:59:44.293894821 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 14] Process group watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.20.5
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
Last error:
NET/IB : Got completion from peer 10.140.24.106<37762> with status=5 opcode=129 len=6 vendor err 249 (Flush)
Exception raised from checkForNCCLErrorsInternal at /opt/conda/conda-bld/pytorch_1720538622298/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1892 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f2019fe6f86 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::shared_ptr<c10d::NCCLComm>&) + 0x220 (0x7f201b2d37f0 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7c (0x7f201b2d3a3c in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::watchdogHandler() + 0x213 (0x7f201b2da923 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f201b2dcd2c in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0xdbbf4 (0x7f207d9aabf4 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #6: <unknown function> + 0x7dd5 (0x7f208cc42dd5 in /lib64/libpthread.so.0)
frame #7: clone + 0x6d (0x7f208c262ead in /lib64/libc.so.6)

SH-IDC1-10-140-24-95:214042:214187 [0] NCCL INFO comm 0xaf1b6c0 rank 15 nranks 16 cudaDev 7 busId e7000 - Abort COMPLETE
[rank15]:[E726 10:59:44.300514320 ProcessGroupNCCL.cpp:621] [Rank 15] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank15]:[E726 10:59:44.300529779 ProcessGroupNCCL.cpp:627] [Rank 15] To avoid data inconsistency, we are taking the entire process down.
[rank15]:[E726 10:59:44.300604481 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 15] Process group watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.20.5
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
Last error:
NET/IB : Got completion from peer 10.140.24.106<61176> with status=5 opcode=129 len=7 vendor err 249 (Flush)
Exception raised from checkForNCCLErrorsInternal at /opt/conda/conda-bld/pytorch_1720538622298/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1892 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f3a48ee4f86 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::shared_ptr<c10d::NCCLComm>&) + 0x220 (0x7f3a4a1d17f0 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7c (0x7f3a4a1d1a3c in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::watchdogHandler() + 0x213 (0x7f3a4a1d8923 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3a4a1dad2c in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0xdbbf4 (0x7f3aac8a8bf4 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #6: <unknown function> + 0x7dd5 (0x7f3abbb40dd5 in /lib64/libpthread.so.0)
frame #7: clone + 0x6d (0x7f3abb160ead in /lib64/libc.so.6)

SH-IDC1-10-140-24-95:214031:214765 [0] NCCL INFO [Service thread] Connection closed by localRank 3
SH-IDC1-10-140-24-95:214034:214764 [3] NCCL INFO [Service thread] Connection closed by localRank 3
SH-IDC1-10-140-24-95:214034:214181 [0] NCCL INFO comm 0xdd87e550 rank 11 nranks 16 cudaDev 3 busId 6a000 - Abort COMPLETE
[rank11]:[E726 10:59:44.365679787 ProcessGroupNCCL.cpp:621] [Rank 11] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank11]:[E726 10:59:44.365693163 ProcessGroupNCCL.cpp:627] [Rank 11] To avoid data inconsistency, we are taking the entire process down.
[rank11]:[E726 10:59:44.365753788 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 11] Process group watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.20.5
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
Last error:
NET/IB : Got completion from peer 10.140.24.106<37760> with status=5 opcode=129 len=3 vendor err 249 (Flush)
Exception raised from checkForNCCLErrorsInternal at /opt/conda/conda-bld/pytorch_1720538622298/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1892 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f84811b2f86 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::shared_ptr<c10d::NCCLComm>&) + 0x220 (0x7f848249f7f0 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7c (0x7f848249fa3c in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::watchdogHandler() + 0x213 (0x7f84824a6923 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f84824a8d2c in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0xdbbf4 (0x7f84e4b76bf4 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #6: <unknown function> + 0x7dd5 (0x7f84f3e0edd5 in /lib64/libpthread.so.0)
frame #7: clone + 0x6d (0x7f84f342eead in /lib64/libc.so.6)

W0726 10:59:45.447573 139901143013184 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 214031 closing signal SIGTERM
W0726 10:59:45.448162 139901143013184 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 214032 closing signal SIGTERM
W0726 10:59:45.504283 139901143013184 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 214034 closing signal SIGTERM
W0726 10:59:45.509025 139901143013184 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 214035 closing signal SIGTERM
W0726 10:59:45.542702 139901143013184 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 214038 closing signal SIGTERM
W0726 10:59:45.546265 139901143013184 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 214040 closing signal SIGTERM
W0726 10:59:45.588805 139901143013184 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 214042 closing signal SIGTERM
E0726 10:59:49.951559 139901143013184 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 2 (pid: 214033) of binary: /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/bin/python
Traceback (most recent call last):
  File "/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/accelerate/commands/launch.py", line 1093, in launch_command
    multi_gpu_launcher(args)
  File "/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/accelerate/commands/launch.py", line 734, in multi_gpu_launcher
    distrib_run.run(args)
  File "/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
train_wrapper.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-26_10:59:45
  host      : SH-IDC1-10-140-24-95
  rank      : 10 (local_rank: 2)
  exitcode  : -6 (pid: 214033)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 214033
=======================================================
srun: error: SH-IDC1-10-140-24-95: task 0: Exited with exit code 1
W0726 10:59:50.786825 140538811684672 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 147865 closing signal SIGTERM
W0726 10:59:50.787296 140538811684672 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 147866 closing signal SIGTERM
W0726 10:59:50.842426 140538811684672 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 147867 closing signal SIGTERM
W0726 10:59:50.868046 140538811684672 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 147868 closing signal SIGTERM
W0726 10:59:50.893290 140538811684672 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 147869 closing signal SIGTERM
W0726 10:59:50.923330 140538811684672 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 147872 closing signal SIGTERM
W0726 10:59:50.944047 140538811684672 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 147874 closing signal SIGTERM
W0726 10:59:50.962826 140538811684672 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 147876 closing signal SIGTERM
W0726 10:59:52.793814 140534967199488 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1267] The node 'SH-IDC1-10-140-24-106_147766_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError.
W0726 10:59:56.810442 140538811684672 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1218] The node 'SH-IDC1-10-140-24-106_147766_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError.
W0726 10:59:56.811984 140538811684672 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1218] The node 'SH-IDC1-10-140-24-106_147766_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError.
Traceback (most recent call last):
  File "/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 114, in _call_store
    return getattr(self._store, store_op)(*args, **kwargs)
torch.distributed.DistNetworkError: Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/accelerate/commands/launch.py", line 1093, in launch_command
    multi_gpu_launcher(args)
  File "/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/accelerate/commands/launch.py", line 734, in multi_gpu_launcher
    distrib_run.run(args)
  File "/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 255, in launch_agent
    result = agent.run()
  File "/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
    result = f(*args, **kwargs)
  File "/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 680, in run
    result = self._invoke_run(role)
  File "/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 867, in _invoke_run
    num_nodes_waiting = rdzv_handler.num_nodes_waiting()
  File "/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1189, in num_nodes_waiting
    self._state_holder.sync()
  File "/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 428, in sync
    get_response = self._backend.get_state()
  File "/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 74, in get_state
    base64_state: bytes = self._call_store("get", self._key)
  File "/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 116, in _call_store
    raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.
srun: error: SH-IDC1-10-140-24-106: task 1: Exited with exit code 1
Nightmare-n commented 1 month ago

change fsdp_state_dict_type: SHARDED_STATE_DICT to FULL_STATE_DICT can avoid this error. But when resuming the ckpt, multiple processes will occur on the same GPU.

image
forever208 commented 1 month ago

I have the save issue, any suggestions?