Error when saving ckpt on slurm

07/26/2024 10:59:38   INFO  Saving FSDP model
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:737: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  local_shape = tensor.shape
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:749: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.shape,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:737: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  local_shape = tensor.shape
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:737: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  local_shape = tensor.shape
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:751: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.dtype,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:737: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  local_shape = tensor.shape
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:752: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.device,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:749: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.shape,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:737: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  local_shape = tensor.shape
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:749: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.shape,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:749: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.shape,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:751: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.dtype,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:751: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.dtype,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:749: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.shape,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:752: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.device,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:751: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.dtype,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:737: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  local_shape = tensor.shape
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:752: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.device,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:749: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.shape,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:751: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.dtype,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:752: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.device,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:751: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.dtype,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:752: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.device,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:752: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.device,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:737: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  local_shape = tensor.shape
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:737: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  local_shape = tensor.shape
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:749: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.shape,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:749: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.shape,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:751: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.dtype,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:751: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.dtype,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:752: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.device,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:752: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.device,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:737: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  local_shape = tensor.shape
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:737: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  local_shape = tensor.shape
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:749: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.shape,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:749: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.shape,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:751: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.dtype,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:751: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.dtype,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:752: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.device,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:752: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.device,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:737: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  local_shape = tensor.shape
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:737: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  local_shape = tensor.shape
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:749: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.shape,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:737: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  local_shape = tensor.shape
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:751: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.dtype,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:749: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.shape,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:752: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.device,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:749: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.shape,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:737: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  local_shape = tensor.shape
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:751: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.dtype,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:749: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.shape,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:751: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.dtype,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:751: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.dtype,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:752: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.device,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:737: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  local_shape = tensor.shape
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:752: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.device,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:752: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.device,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:737: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  local_shape = tensor.shape
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:749: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.shape,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:749: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.shape,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:751: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.dtype,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:751: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.dtype,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:752: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.device,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/fsdp/_state_dict_utils.py:752: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.device,
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/accelerate/utils/fsdp_utils.py:90: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead.
  dist_cp.save_state_dict(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/accelerate/utils/fsdp_utils.py:90: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead.
  dist_cp.save_state_dict(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/accelerate/utils/fsdp_utils.py:90: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead.
  dist_cp.save_state_dict(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/accelerate/utils/fsdp_utils.py:90: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead.
  dist_cp.save_state_dict(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/accelerate/utils/fsdp_utils.py:90: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead.
  dist_cp.save_state_dict(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/accelerate/utils/fsdp_utils.py:90: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead.
  dist_cp.save_state_dict(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/accelerate/utils/fsdp_utils.py:90: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead.
  dist_cp.save_state_dict(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/accelerate/utils/fsdp_utils.py:90: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead.
  dist_cp.save_state_dict(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/accelerate/utils/fsdp_utils.py:90: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead.
  dist_cp.save_state_dict(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/accelerate/utils/fsdp_utils.py:90: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead.
  dist_cp.save_state_dict(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/accelerate/utils/fsdp_utils.py:90: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead.
  dist_cp.save_state_dict(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/accelerate/utils/fsdp_utils.py:90: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead.
  dist_cp.save_state_dict(
07/26/2024 10:59:41   INFO  Saving model to /mnt/petrelfs/yanghonghui/projects/OpenPCDet_torch2.4.0/output/game_models/sd3_rope/debug/checkpoints/checkpoint_1/pytorch_model_fsdp_0
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/accelerate/utils/fsdp_utils.py:90: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead.
  dist_cp.save_state_dict(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/accelerate/utils/fsdp_utils.py:90: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead.
  dist_cp.save_state_dict(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/accelerate/utils/fsdp_utils.py:90: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead.
  dist_cp.save_state_dict(
/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/accelerate/utils/fsdp_utils.py:90: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead.
  dist_cp.save_state_dict(
SH-IDC1-10-140-24-106:147874:192057 [6] NCCL INFO Channel 00/1 : 6[6] -> 0[0] via P2P/CUMEM/read
SH-IDC1-10-140-24-106:147866:192060 [1] NCCL INFO Channel 00/1 : 1[1] -> 0[0] via P2P/CUMEM/read
SH-IDC1-10-140-24-95:214042:258021 [7] NCCL INFO Channel 01/1 : 15[7] -> 0[0] [send] via NET/IB/0(8)/GDRDMA/Shared
SH-IDC1-10-140-24-95:214040:258019 [6] NCCL INFO Channel 01/1 : 14[6] -> 0[0] [send] via NET/IB/0(8)/GDRDMA/Shared
SH-IDC1-10-140-24-106:147867:192061 [2] NCCL INFO Channel 00/1 : 2[2] -> 0[0] via P2P/CUMEM/read
SH-IDC1-10-140-24-95:214035:258020 [4] NCCL INFO Channel 01/1 : 12[4] -> 0[0] [send] via NET/IB/0(8)/GDRDMA/Shared
SH-IDC1-10-140-24-106:147869:192059 [4] NCCL INFO Channel 00/1 : 4[4] -> 0[0] via P2P/CUMEM/read
SH-IDC1-10-140-24-95:214038:258026 [5] NCCL INFO Channel 01/1 : 13[5] -> 0[0] [send] via NET/IB/0(8)/GDRDMA/Shared
SH-IDC1-10-140-24-106:147872:192064 [5] NCCL INFO Channel 00/1 : 5[5] -> 0[0] via P2P/CUMEM/read
SH-IDC1-10-140-24-95:214033:258022 [2] NCCL INFO Channel 01/1 : 10[2] -> 0[0] [send] via NET/IB/0(8)/GDRDMA/Shared
SH-IDC1-10-140-24-106:147876:192058 [7] NCCL INFO Channel 00/1 : 7[7] -> 0[0] via P2P/CUMEM/read
SH-IDC1-10-140-24-95:214034:258023 [3] NCCL INFO Channel 01/1 : 11[3] -> 0[0] [send] via NET/IB/0(8)/GDRDMA/Shared
SH-IDC1-10-140-24-95:214032:258024 [1] NCCL INFO Channel 01/1 : 9[1] -> 0[0] [send] via NET/IB/0(8)/GDRDMA/Shared
SH-IDC1-10-140-24-106:147868:192062 [3] NCCL INFO Channel 00/1 : 3[3] -> 0[0] via P2P/CUMEM/read
SH-IDC1-10-140-24-95:214031:258025 [0] NCCL INFO Channel 01/1 : 8[0] -> 0[0] [send] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-106:147865:192063 [0] NCCL INFO Channel 01/1 : 15[7] -> 0[0] [receive] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-106:147865:192063 [0] NCCL INFO Channel 01/1 : 14[6] -> 0[0] [receive] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-106:147865:192063 [0] NCCL INFO Channel 01/1 : 13[5] -> 0[0] [receive] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-106:147865:192063 [0] NCCL INFO Channel 01/1 : 12[4] -> 0[0] [receive] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-106:147865:192063 [0] NCCL INFO Channel 01/1 : 11[3] -> 0[0] [receive] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-106:147865:192063 [0] NCCL INFO Channel 01/1 : 10[2] -> 0[0] [receive] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-106:147865:192063 [0] NCCL INFO Channel 01/1 : 9[1] -> 0[0] [receive] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-106:147865:192063 [0] NCCL INFO Channel 01/1 : 8[0] -> 0[0] [receive] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-95:214040:258033 [6] NCCL INFO Channel 01/1 : 0[0] -> 14[6] [receive] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-106:147865:192085 [0] NCCL INFO Channel 00/1 : 0[0] -> 1[1] via P2P/CUMEM/read
SH-IDC1-10-140-24-95:214035:258034 [4] NCCL INFO Channel 01/1 : 0[0] -> 12[4] [receive] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-95:214042:258032 [7] NCCL INFO Channel 01/1 : 0[0] -> 15[7] [receive] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-95:214034:258035 [3] NCCL INFO Channel 01/1 : 0[0] -> 11[3] [receive] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-95:214033:258030 [2] NCCL INFO Channel 01/1 : 0[0] -> 10[2] [receive] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-95:214031:258037 [0] NCCL INFO Channel 01/1 : 0[0] -> 8[0] [receive] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-106:147865:192085 [0] NCCL INFO Channel 00/1 : 0[0] -> 2[2] via P2P/CUMEM/read
SH-IDC1-10-140-24-95:214032:258036 [1] NCCL INFO Channel 01/1 : 0[0] -> 9[1] [receive] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-106:147865:192085 [0] NCCL INFO Channel 00/1 : 0[0] -> 3[3] via P2P/CUMEM/read
SH-IDC1-10-140-24-95:214038:258038 [5] NCCL INFO Channel 01/1 : 0[0] -> 13[5] [receive] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-106:147865:192085 [0] NCCL INFO Channel 00/1 : 0[0] -> 4[4] via P2P/CUMEM/read
SH-IDC1-10-140-24-106:147865:192085 [0] NCCL INFO Channel 00/1 : 0[0] -> 5[5] via P2P/CUMEM/read
SH-IDC1-10-140-24-106:147865:192085 [0] NCCL INFO Channel 00/1 : 0[0] -> 6[6] via P2P/CUMEM/read
SH-IDC1-10-140-24-106:147865:192085 [0] NCCL INFO Channel 00/1 : 0[0] -> 7[7] via P2P/CUMEM/read
SH-IDC1-10-140-24-106:147865:192085 [0] NCCL INFO Channel 01/1 : 0[0] -> 8[0] [send] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-106:147865:192085 [0] NCCL INFO Channel 01/1 : 0[0] -> 9[1] [send] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-106:147865:192085 [0] NCCL INFO Channel 01/1 : 0[0] -> 10[2] [send] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-106:147865:192085 [0] NCCL INFO Channel 01/1 : 0[0] -> 11[3] [send] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-106:147865:192085 [0] NCCL INFO Channel 01/1 : 0[0] -> 12[4] [send] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-106:147865:192085 [0] NCCL INFO Channel 01/1 : 0[0] -> 13[5] [send] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-106:147865:192085 [0] NCCL INFO Channel 01/1 : 0[0] -> 14[6] [send] via NET/IB/0/GDRDMA/Shared
SH-IDC1-10-140-24-106:147865:192085 [0] NCCL INFO Channel 01/1 : 0[0] -> 15[7] [send] via NET/IB/0/GDRDMA/Shared
mlx5: SH-IDC1-10-140-24-95: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000001 00000000 00000000 00000000
00000000 00008914 100009e8 000045d2
mlx5: SH-IDC1-10-140-24-95: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000001 00000000 00000000 00000000
00000000 00008914 100009e6 00004bd2

SH-IDC1-10-140-24-95:214034:214782 [3] transport/net_ib.cc:1696 NCCL WARN NET/IB : Got completion from peer 10.140.24.106<37760> with status=11 opcode=129 len=0 vendor err 137 (Flush)
SH-IDC1-10-140-24-95:214034:214782 [3] NCCL INFO transport/net.cc:1374 -> 6

SH-IDC1-10-140-24-95:214033:214783 [2] transport/net_ib.cc:1696 NCCL WARN NET/IB : Got completion from peer 10.140.24.106<27076> with status=11 opcode=129 len=0 vendor err 137 (Flush)
SH-IDC1-10-140-24-95:214034:214782 [3] NCCL INFO proxy.cc:694 -> 6
SH-IDC1-10-140-24-95:214033:214783 [2] NCCL INFO transport/net.cc:1374 -> 6
SH-IDC1-10-140-24-95:214034:214782 [3] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]
SH-IDC1-10-140-24-95:214033:214783 [2] NCCL INFO proxy.cc:694 -> 6
SH-IDC1-10-140-24-95:214033:214783 [2] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]
mlx5: SH-IDC1-10-140-24-95: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000001 00000000 00000000 00000000
00000000 00008914 100009ec 000041d2

SH-IDC1-10-140-24-95:214034:214782 [3] transport/net_ib.cc:1696 NCCL WARN NET/IB : Got completion from peer 10.140.24.106<37760> with status=5 opcode=129 len=3 vendor err 249 (Flush)

SH-IDC1-10-140-24-95:214033:214783 [2] transport/net_ib.cc:1696 NCCL WARN NET/IB : Got completion from peer 10.140.24.106<27076> with status=5 opcode=129 len=2 vendor err 249 (Flush)
SH-IDC1-10-140-24-95:214034:214782 [3] NCCL INFO transport/net.cc:1374 -> 6
SH-IDC1-10-140-24-95:214033:214783 [2] NCCL INFO transport/net.cc:1374 -> 6
SH-IDC1-10-140-24-95:214034:214782 [3] NCCL INFO proxy.cc:694 -> 6
SH-IDC1-10-140-24-95:214033:214783 [2] NCCL INFO proxy.cc:694 -> 6
SH-IDC1-10-140-24-95:214034:214782 [3] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]
SH-IDC1-10-140-24-95:214033:214783 [2] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]

SH-IDC1-10-140-24-95:214033:214783 [2] transport/net_ib.cc:1696 NCCL WARN NET/IB : Got completion from peer 10.140.24.106<27076> with status=5 opcode=129 len=2 vendor err 249 (Flush)
SH-IDC1-10-140-24-95:214033:214783 [2] NCCL INFO transport/net.cc:1374 -> 6
SH-IDC1-10-140-24-95:214033:214783 [2] NCCL INFO proxy.cc:694 -> 6
SH-IDC1-10-140-24-95:214033:214783 [2] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]
mlx5: SH-IDC1-10-140-24-95: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000001 00000000 00000000 00000000
00000000 00008914 100009ea 000047d2

SH-IDC1-10-140-24-95:214038:214779 [5] transport/net_ib.cc:1696 NCCL WARN NET/IB : Got completion from peer 10.140.24.106<47174> with status=11 opcode=129 len=0 vendor err 137 (Flush)
SH-IDC1-10-140-24-95:214038:214779 [5] NCCL INFO transport/net.cc:1374 -> 6
SH-IDC1-10-140-24-95:214038:214779 [5] NCCL INFO proxy.cc:694 -> 6
SH-IDC1-10-140-24-95:214038:214779 [5] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]

SH-IDC1-10-140-24-95:214038:214779 [5] transport/net_ib.cc:1696 NCCL WARN NET/IB : Got completion from peer 10.140.24.106<47174> with status=5 opcode=129 len=5 vendor err 249 (Flush)
SH-IDC1-10-140-24-95:214038:214779 [5] NCCL INFO transport/net.cc:1374 -> 6
SH-IDC1-10-140-24-95:214038:214779 [5] NCCL INFO proxy.cc:694 -> 6
SH-IDC1-10-140-24-95:214038:214779 [5] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]

SH-IDC1-10-140-24-95:214038:214779 [5] transport/net_ib.cc:1696 NCCL WARN NET/IB : Got completion from peer 10.140.24.106<47174> with status=5 opcode=129 len=5 vendor err 249 (Flush)
SH-IDC1-10-140-24-95:214038:214779 [5] NCCL INFO transport/net.cc:1374 -> 6
SH-IDC1-10-140-24-95:214038:214779 [5] NCCL INFO proxy.cc:694 -> 6
SH-IDC1-10-140-24-95:214038:214779 [5] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]

SH-IDC1-10-140-24-95:214035:214781 [4] transport/net_ib.cc:1696 NCCL WARN NET/IB : Got completion from peer 10.140.24.106<5776> with status=11 opcode=129 len=0 vendor err 137 (Flush)
SH-IDC1-10-140-24-95:214035:214781 [4] NCCL INFO transport/net.cc:1374 -> 6
SH-IDC1-10-140-24-95:214035:214781 [4] NCCL INFO proxy.cc:694 -> 6
SH-IDC1-10-140-24-95:214035:214781 [4] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]

SH-IDC1-10-140-24-95:214035:214781 [4] transport/net_ib.cc:1696 NCCL WARN NET/IB : Got completion from peer 10.140.24.106<5776> with status=5 opcode=129 len=4 vendor err 249 (Flush)
SH-IDC1-10-140-24-95:214035:214781 [4] NCCL INFO transport/net.cc:1374 -> 6
SH-IDC1-10-140-24-95:214035:214781 [4] NCCL INFO proxy.cc:694 -> 6
SH-IDC1-10-140-24-95:214035:214781 [4] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]

SH-IDC1-10-140-24-95:214035:214781 [4] transport/net_ib.cc:1696 NCCL WARN NET/IB : Got completion from peer 10.140.24.106<5776> with status=5 opcode=129 len=4 vendor err 249 (Flush)
SH-IDC1-10-140-24-95:214035:214781 [4] NCCL INFO transport/net.cc:1374 -> 6
SH-IDC1-10-140-24-95:214035:214781 [4] NCCL INFO proxy.cc:694 -> 6
SH-IDC1-10-140-24-95:214035:214781 [4] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]
mlx5: SH-IDC1-10-140-24-95: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000001 00000000 00000000 00000000
00000000 00008914 100009f0 00005dd2

SH-IDC1-10-140-24-95:214042:214778 [7] transport/net_ib.cc:1696 NCCL WARN NET/IB : Got completion from peer 10.140.24.106<61176> with status=11 opcode=129 len=0 vendor err 137 (Flush)
SH-IDC1-10-140-24-95:214042:214778 [7] NCCL INFO transport/net.cc:1374 -> 6
SH-IDC1-10-140-24-95:214042:214778 [7] NCCL INFO proxy.cc:694 -> 6
SH-IDC1-10-140-24-95:214042:214778 [7] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]

SH-IDC1-10-140-24-95:214042:214778 [7] transport/net_ib.cc:1696 NCCL WARN NET/IB : Got completion from peer 10.140.24.106<61176> with status=5 opcode=129 len=7 vendor err 249 (Flush)
SH-IDC1-10-140-24-95:214042:214778 [7] NCCL INFO transport/net.cc:1374 -> 6
SH-IDC1-10-140-24-95:214042:214778 [7] NCCL INFO proxy.cc:694 -> 6
SH-IDC1-10-140-24-95:214042:214778 [7] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]

SH-IDC1-10-140-24-95:214042:214778 [7] transport/net_ib.cc:1696 NCCL WARN NET/IB : Got completion from peer 10.140.24.106<61176> with status=5 opcode=129 len=7 vendor err 249 (Flush)
SH-IDC1-10-140-24-95:214042:214778 [7] NCCL INFO transport/net.cc:1374 -> 6
SH-IDC1-10-140-24-95:214042:214778 [7] NCCL INFO proxy.cc:694 -> 6
SH-IDC1-10-140-24-95:214042:214778 [7] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]

SH-IDC1-10-140-24-95:214042:214778 [7] transport/net_ib.cc:1696 NCCL WARN NET/IB : Got completion from peer 10.140.24.106<61176> with status=5 opcode=129 len=7 vendor err 249 (Flush)
SH-IDC1-10-140-24-95:214042:214778 [7] NCCL INFO transport/net.cc:1374 -> 6
SH-IDC1-10-140-24-95:214042:214778 [7] NCCL INFO proxy.cc:694 -> 6
SH-IDC1-10-140-24-95:214042:214778 [7] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]
mlx5: SH-IDC1-10-140-24-95: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000001 00000000 00000000 00000000
00000000 00008914 100009ee 000043d2

SH-IDC1-10-140-24-95:214040:214780 [6] transport/net_ib.cc:1696 NCCL WARN NET/IB : Got completion from peer 10.140.24.106<37762> with status=11 opcode=129 len=0 vendor err 137 (Flush)
SH-IDC1-10-140-24-95:214040:214780 [6] NCCL INFO transport/net.cc:1374 -> 6
SH-IDC1-10-140-24-95:214040:214780 [6] NCCL INFO proxy.cc:694 -> 6
SH-IDC1-10-140-24-95:214040:214780 [6] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]

SH-IDC1-10-140-24-95:214040:214780 [6] transport/net_ib.cc:1696 NCCL WARN NET/IB : Got completion from peer 10.140.24.106<37762> with status=5 opcode=129 len=6 vendor err 249 (Flush)
SH-IDC1-10-140-24-95:214040:214780 [6] NCCL INFO transport/net.cc:1374 -> 6
SH-IDC1-10-140-24-95:214040:214780 [6] NCCL INFO proxy.cc:694 -> 6
SH-IDC1-10-140-24-95:214040:214780 [6] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]

SH-IDC1-10-140-24-95:214040:214780 [6] transport/net_ib.cc:1696 NCCL WARN NET/IB : Got completion from peer 10.140.24.106<37762> with status=5 opcode=129 len=6 vendor err 249 (Flush)
SH-IDC1-10-140-24-95:214040:214780 [6] NCCL INFO transport/net.cc:1374 -> 6
SH-IDC1-10-140-24-95:214040:214780 [6] NCCL INFO proxy.cc:694 -> 6
SH-IDC1-10-140-24-95:214040:214780 [6] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]
[rank15]:[E726 10:59:43.692836266 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 15] Exception (either an error or timeout) detected by watchdog at work: 184, last enqueued NCCL work: 184, last completed NCCL work: 183.
[rank10]:[E726 10:59:43.692913143 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 10] Exception (either an error or timeout) detected by watchdog at work: 184, last enqueued NCCL work: 184, last completed NCCL work: 183.
[rank11]:[E726 10:59:43.692913153 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 11] Exception (either an error or timeout) detected by watchdog at work: 184, last enqueued NCCL work: 184, last completed NCCL work: 183.
[rank13]:[E726 10:59:43.693201512 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 13] Exception (either an error or timeout) detected by watchdog at work: 184, last enqueued NCCL work: 184, last completed NCCL work: 183.
[rank12]:[E726 10:59:43.695610972 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 12] Exception (either an error or timeout) detected by watchdog at work: 184, last enqueued NCCL work: 184, last completed NCCL work: 183.
[rank14]:[E726 10:59:43.724105047 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 14] Exception (either an error or timeout) detected by watchdog at work: 184, last enqueued NCCL work: 184, last completed NCCL work: 183.
SH-IDC1-10-140-24-95:214031:214765 [0] NCCL INFO [Service thread] Connection closed by localRank 2
SH-IDC1-10-140-24-95:214033:214763 [2] NCCL INFO [Service thread] Connection closed by localRank 2
SH-IDC1-10-140-24-95:214033:214190 [0] NCCL INFO comm 0x99e59c0 rank 10 nranks 16 cudaDev 2 busId 65000 - Abort COMPLETE
[rank10]:[E726 10:59:44.260357278 ProcessGroupNCCL.cpp:621] [Rank 10] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank10]:[E726 10:59:44.260366596 ProcessGroupNCCL.cpp:627] [Rank 10] To avoid data inconsistency, we are taking the entire process down.
[rank10]:[E726 10:59:44.260431229 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 10] Process group watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.20.5
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
Last error:
NET/IB : Got completion from peer 10.140.24.106<27076> with status=5 opcode=129 len=2 vendor err 249 (Flush)
Exception raised from checkForNCCLErrorsInternal at /opt/conda/conda-bld/pytorch_1720538622298/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1892 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc0e41f2f86 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::shared_ptr<c10d::NCCLComm>&) + 0x220 (0x7fc0e54df7f0 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7c (0x7fc0e54dfa3c in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::watchdogHandler() + 0x213 (0x7fc0e54e6923 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc0e54e8d2c in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0xdbbf4 (0x7fc147bb6bf4 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #6: <unknown function> + 0x7dd5 (0x7fc156e4edd5 in /lib64/libpthread.so.0)
frame #7: clone + 0x6d (0x7fc15646eead in /lib64/libc.so.6)

SH-IDC1-10-140-24-95:214031:214765 [0] NCCL INFO [Service thread] Connection closed by localRank 4
SH-IDC1-10-140-24-95:214035:214760 [4] NCCL INFO [Service thread] Connection closed by localRank 4
SH-IDC1-10-140-24-95:214031:214765 [0] NCCL INFO [Service thread] Connection closed by localRank 5
SH-IDC1-10-140-24-95:214038:214761 [5] NCCL INFO [Service thread] Connection closed by localRank 5
SH-IDC1-10-140-24-95:214031:214765 [0] NCCL INFO [Service thread] Connection closed by localRank 6
SH-IDC1-10-140-24-95:214040:214762 [6] NCCL INFO [Service thread] Connection closed by localRank 6
SH-IDC1-10-140-24-95:214031:214765 [0] NCCL INFO [Service thread] Connection closed by localRank 7
SH-IDC1-10-140-24-95:214042:214774 [7] NCCL INFO [Service thread] Connection closed by localRank 7
SH-IDC1-10-140-24-95:214035:214179 [0] NCCL INFO comm 0x93a8c00 rank 12 nranks 16 cudaDev 4 busId a3000 - Abort COMPLETE
[rank12]:[E726 10:59:44.284494485 ProcessGroupNCCL.cpp:621] [Rank 12] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank12]:[E726 10:59:44.284505756 ProcessGroupNCCL.cpp:627] [Rank 12] To avoid data inconsistency, we are taking the entire process down.
[rank12]:[E726 10:59:44.284578655 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 12] Process group watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.20.5
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
Last error:
NET/IB : Got completion from peer 10.140.24.106<5776> with status=5 opcode=129 len=4 vendor err 249 (Flush)
Exception raised from checkForNCCLErrorsInternal at /opt/conda/conda-bld/pytorch_1720538622298/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1892 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f0b05001f86 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::shared_ptr<c10d::NCCLComm>&) + 0x220 (0x7f0b062ee7f0 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7c (0x7f0b062eea3c in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::watchdogHandler() + 0x213 (0x7f0b062f5923 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f0b062f7d2c in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0xdbbf4 (0x7f0b689c5bf4 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #6: <unknown function> + 0x7dd5 (0x7f0b77c5ddd5 in /lib64/libpthread.so.0)
frame #7: clone + 0x6d (0x7f0b7727dead in /lib64/libc.so.6)

SH-IDC1-10-140-24-95:214038:214193 [0] NCCL INFO comm 0xaa13c80 rank 13 nranks 16 cudaDev 5 busId a8000 - Abort COMPLETE
[rank13]:[E726 10:59:44.290108228 ProcessGroupNCCL.cpp:621] [Rank 13] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank13]:[E726 10:59:44.290121924 ProcessGroupNCCL.cpp:627] [Rank 13] To avoid data inconsistency, we are taking the entire process down.
[rank13]:[E726 10:59:44.290391618 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 13] Process group watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.20.5
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
Last error:
NET/IB : Got completion from peer 10.140.24.106<47174> with status=5 opcode=129 len=5 vendor err 249 (Flush)
Exception raised from checkForNCCLErrorsInternal at /opt/conda/conda-bld/pytorch_1720538622298/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1892 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f7832bd1f86 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::shared_ptr<c10d::NCCLComm>&) + 0x220 (0x7f7833ebe7f0 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7c (0x7f7833ebea3c in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::watchdogHandler() + 0x213 (0x7f7833ec5923 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f7833ec7d2c in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0xdbbf4 (0x7f7896595bf4 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #6: <unknown function> + 0x7dd5 (0x7f78a582ddd5 in /lib64/libpthread.so.0)
frame #7: clone + 0x6d (0x7f78a4e4dead in /lib64/libc.so.6)

SH-IDC1-10-140-24-95:214040:214204 [0] NCCL INFO comm 0x9a22940 rank 14 nranks 16 cudaDev 6 busId e1000 - Abort COMPLETE
[rank14]:[E726 10:59:44.293817263 ProcessGroupNCCL.cpp:621] [Rank 14] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank14]:[E726 10:59:44.293827222 ProcessGroupNCCL.cpp:627] [Rank 14] To avoid data inconsistency, we are taking the entire process down.
[rank14]:[E726 10:59:44.293894821 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 14] Process group watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.20.5
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
Last error:
NET/IB : Got completion from peer 10.140.24.106<37762> with status=5 opcode=129 len=6 vendor err 249 (Flush)
Exception raised from checkForNCCLErrorsInternal at /opt/conda/conda-bld/pytorch_1720538622298/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1892 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f2019fe6f86 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::shared_ptr<c10d::NCCLComm>&) + 0x220 (0x7f201b2d37f0 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7c (0x7f201b2d3a3c in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::watchdogHandler() + 0x213 (0x7f201b2da923 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f201b2dcd2c in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0xdbbf4 (0x7f207d9aabf4 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #6: <unknown function> + 0x7dd5 (0x7f208cc42dd5 in /lib64/libpthread.so.0)
frame #7: clone + 0x6d (0x7f208c262ead in /lib64/libc.so.6)

SH-IDC1-10-140-24-95:214042:214187 [0] NCCL INFO comm 0xaf1b6c0 rank 15 nranks 16 cudaDev 7 busId e7000 - Abort COMPLETE
[rank15]:[E726 10:59:44.300514320 ProcessGroupNCCL.cpp:621] [Rank 15] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank15]:[E726 10:59:44.300529779 ProcessGroupNCCL.cpp:627] [Rank 15] To avoid data inconsistency, we are taking the entire process down.
[rank15]:[E726 10:59:44.300604481 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 15] Process group watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.20.5
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
Last error:
NET/IB : Got completion from peer 10.140.24.106<61176> with status=5 opcode=129 len=7 vendor err 249 (Flush)
Exception raised from checkForNCCLErrorsInternal at /opt/conda/conda-bld/pytorch_1720538622298/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1892 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f3a48ee4f86 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::shared_ptr<c10d::NCCLComm>&) + 0x220 (0x7f3a4a1d17f0 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7c (0x7f3a4a1d1a3c in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::watchdogHandler() + 0x213 (0x7f3a4a1d8923 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3a4a1dad2c in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0xdbbf4 (0x7f3aac8a8bf4 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #6: <unknown function> + 0x7dd5 (0x7f3abbb40dd5 in /lib64/libpthread.so.0)
frame #7: clone + 0x6d (0x7f3abb160ead in /lib64/libc.so.6)

SH-IDC1-10-140-24-95:214031:214765 [0] NCCL INFO [Service thread] Connection closed by localRank 3
SH-IDC1-10-140-24-95:214034:214764 [3] NCCL INFO [Service thread] Connection closed by localRank 3
SH-IDC1-10-140-24-95:214034:214181 [0] NCCL INFO comm 0xdd87e550 rank 11 nranks 16 cudaDev 3 busId 6a000 - Abort COMPLETE
[rank11]:[E726 10:59:44.365679787 ProcessGroupNCCL.cpp:621] [Rank 11] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank11]:[E726 10:59:44.365693163 ProcessGroupNCCL.cpp:627] [Rank 11] To avoid data inconsistency, we are taking the entire process down.
[rank11]:[E726 10:59:44.365753788 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 11] Process group watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.20.5
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
Last error:
NET/IB : Got completion from peer 10.140.24.106<37760> with status=5 opcode=129 len=3 vendor err 249 (Flush)
Exception raised from checkForNCCLErrorsInternal at /opt/conda/conda-bld/pytorch_1720538622298/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1892 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f84811b2f86 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::shared_ptr<c10d::NCCLComm>&) + 0x220 (0x7f848249f7f0 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7c (0x7f848249fa3c in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::watchdogHandler() + 0x213 (0x7f84824a6923 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f84824a8d2c in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0xdbbf4 (0x7f84e4b76bf4 in /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #6: <unknown function> + 0x7dd5 (0x7f84f3e0edd5 in /lib64/libpthread.so.0)
frame #7: clone + 0x6d (0x7f84f342eead in /lib64/libc.so.6)

W0726 10:59:45.447573 139901143013184 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 214031 closing signal SIGTERM
W0726 10:59:45.448162 139901143013184 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 214032 closing signal SIGTERM
W0726 10:59:45.504283 139901143013184 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 214034 closing signal SIGTERM
W0726 10:59:45.509025 139901143013184 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 214035 closing signal SIGTERM
W0726 10:59:45.542702 139901143013184 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 214038 closing signal SIGTERM
W0726 10:59:45.546265 139901143013184 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 214040 closing signal SIGTERM
W0726 10:59:45.588805 139901143013184 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 214042 closing signal SIGTERM
E0726 10:59:49.951559 139901143013184 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 2 (pid: 214033) of binary: /mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/bin/python
Traceback (most recent call last):
  File "/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/accelerate/commands/launch.py", line 1093, in launch_command
    multi_gpu_launcher(args)
  File "/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/accelerate/commands/launch.py", line 734, in multi_gpu_launcher
    distrib_run.run(args)
  File "/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
train_wrapper.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-26_10:59:45
  host      : SH-IDC1-10-140-24-95
  rank      : 10 (local_rank: 2)
  exitcode  : -6 (pid: 214033)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 214033
=======================================================
srun: error: SH-IDC1-10-140-24-95: task 0: Exited with exit code 1
W0726 10:59:50.786825 140538811684672 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 147865 closing signal SIGTERM
W0726 10:59:50.787296 140538811684672 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 147866 closing signal SIGTERM
W0726 10:59:50.842426 140538811684672 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 147867 closing signal SIGTERM
W0726 10:59:50.868046 140538811684672 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 147868 closing signal SIGTERM
W0726 10:59:50.893290 140538811684672 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 147869 closing signal SIGTERM
W0726 10:59:50.923330 140538811684672 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 147872 closing signal SIGTERM
W0726 10:59:50.944047 140538811684672 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 147874 closing signal SIGTERM
W0726 10:59:50.962826 140538811684672 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 147876 closing signal SIGTERM
W0726 10:59:52.793814 140534967199488 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1267] The node 'SH-IDC1-10-140-24-106_147766_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError.
W0726 10:59:56.810442 140538811684672 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1218] The node 'SH-IDC1-10-140-24-106_147766_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError.
W0726 10:59:56.811984 140538811684672 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1218] The node 'SH-IDC1-10-140-24-106_147766_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError.
Traceback (most recent call last):
  File "/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 114, in _call_store
    return getattr(self._store, store_op)(*args, **kwargs)
torch.distributed.DistNetworkError: Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/accelerate/commands/launch.py", line 1093, in launch_command
    multi_gpu_launcher(args)
  File "/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/accelerate/commands/launch.py", line 734, in multi_gpu_launcher
    distrib_run.run(args)
  File "/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 255, in launch_agent
    result = agent.run()
  File "/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
    result = f(*args, **kwargs)
  File "/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 680, in run
    result = self._invoke_run(role)
  File "/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 867, in _invoke_run
    num_nodes_waiting = rdzv_handler.num_nodes_waiting()
  File "/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1189, in num_nodes_waiting
    self._state_holder.sync()
  File "/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 428, in sync
    get_response = self._backend.get_state()
  File "/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 74, in get_state
    base64_state: bytes = self._call_store("get", self._key)
  File "/mnt/hwfile/3dv/yhh/miniconda3/envs/pcdet-torch2.4.0-cuda11.8-py3.8/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 116, in _call_store
    raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.
srun: error: SH-IDC1-10-140-24-106: task 1: Exited with exit code 1
huggingface / accelerate

Error when saving ckpt on slurm #2960

System Info

Information

Tasks

Reproduction

Expected behavior