RuntimeError:Unable to meet other process at the rendezvous store

What happened + What you expected to happen

The problem occurs as follows：

屏幕截图 2024-05-31 160931

The first training of our model can be completed normally. When training for the second time, the GPU utilization rate is 100%, causing the training to be unable to continue.

How to write code to avoid this situation from happening.

Versions / Dependencies

jax: 0.4.7 jaxlib: 0.4.7 numpy: 1.23.0 python: 3.9.13 (main, Oct 13 2022, 21:15:33) [GCC 11.2.0] jax.devices (8 total, 8 local): [StreamExecutorGpuDevice(id=0, process_index=0, slice_index=0) StreamExecutorGpuDevice(id=1, process_index=0, slice_index=0) ... StreamExecutorGpuDevice(id=6, process_index=0, slice_index=0) StreamExecutorGpuDevice(id=7, process_index=0, slice_index=0)] process_count: 1

Reproduction script

code show as below： def do_send(self, micro_batch_id, output_vars, dst_rank, group_name='default'): dst_gpu_idx = 0 for var in output_vars: with cupy.cuda.Device(0): send_buffer = self.buffers[micro_batchid][var] if var.aval.dtype==np.bool: send_buffer=send_buffer.astype(np.int32) send_buffer = cupy.array(send_buffer) col.send_multigpu(send_buffer, dst_rank,dst_gpu_idx, group_name) cupy.cuda.Device(0).synchronize()

def do_recv(self, micro_batch_id, input_vars, src_rank, group_name='default'):
    src_gpu_idx = 0
    for var in input_vars:
        with cupy.cuda.Device(0):
            if var.aval.dtype==np.bool_:
                recv_buffer = cupy.zeros(var.aval.shape,dtype=np.int32)
            else:
                recv_buffer = cupy.zeros(var.aval.shape,dtype=var.aval.dtype)
        col.recv_multigpu(recv_buffer, src_rank,src_gpu_idx, group_name)
        cupy.cuda.Device(0).synchronize()
        recv_buffer = recv_buffer.get()
        if var.aval.dtype==np.bool_:
            recv_buffer = recv_buffer.astype(np.bool_)
        val = jax.device_put(recv_buffer)
        if var in self.buffers[-1]:
            self.buffers[-1][var] = val
        else:
            self.buffers[micro_batch_id][var] = val

Issue Severity

None

ray-project / ray