nyu-systems / Grendel-GS

Ongoing research training gaussian splatting at scale by distributed system
Apache License 2.0
372 stars 20 forks source link

[multigpu training] BrokenPipeError #29

Closed EhabWilson closed 1 month ago

EhabWilson commented 1 month ago

I train my model in a 8*A100 node with the script below

srun --nodes=1 \
--job-name=grendel_8g \
--gres=gpu:8 \
--kill-on-bad-exit=1 \
torchrun --standalone --nnodes=1 --nproc-per-node=8 \
train.py \
        -s data \
        --images images \
        --llffhold 8 \
        --iterations 30000 \
        --log_interval 10000 \
        --model_path output/8g_8b \
        --bsz 8 \
        --test_iterations 30000 \
        --save_iterations 30000 \
        --backend gsplat --eval

Then exception occurred

multiprocessing.pool.MaybeEncodingError: Error sending result: 'tensor([[[ 99, 102, 109,  ..., 118, 106,  93],
         [ 94, 100, 105,  ..., 100,  98,  97],
         [ 83,  92, 102,  ..., 107, 111, 114],
         ...,
         [ 68,  69,  68,  ...,  96, 104, 115],
         [ 65,  64,  59,  ...,  96,  95,  96],
         [ 66,  64,  60,  ...,  96,  91,  86]],

        [[ 99, 102, 109,  ..., 119, 110,  98],
         [ 94, 100, 105,  ..., 104, 102, 102],
         [ 86,  95, 102,  ..., 112, 116, 119],
         ...,
         [ 71,  72,  71,  ...,  98, 106, 117],
         [ 68,  67,  62,  ...,  98,  97,  98],
         [ 69,  67,  63,  ...,  98,  93,  88]],

        [[ 71,  74,  81,  ..., 123, 113, 102],
         [ 66,  72,  77,  ..., 107, 105, 106],
         [ 57,  66,  74,  ..., 115, 119, 123],
         ...,
         [ 76,  77,  76,  ...,  93, 101, 112],
         [ 73,  72,  67,  ...,  93,  92,  93],
         [ 74,  72,  68,  ...,  93,  88,  83]]], dtype=torch.uint8)'. Reason: 'RuntimeError('unable to write to file </torch_444535_996896012_715>: No space left on device (28)')'

and

Traceback (most recent call last):
  File "/mnt/petrelfs/zhaohang.p/anaconda3/envs/gaussian_splatting/lib/python3.8/multiprocessing/pool.py", line 131, in worker
    put((job, i, result))
  File "/mnt/petrelfs/zhaohang.p/anaconda3/envs/gaussian_splatting/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/mnt/petrelfs/zhaohang.p/anaconda3/envs/gaussian_splatting/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/mnt/petrelfs/zhaohang.p/anaconda3/envs/gaussian_splatting/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/mnt/petrelfs/zhaohang.p/anaconda3/envs/gaussian_splatting/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
TarzanZhao commented 1 month ago

Hi, I met similar issues before, you can try setting --multiprocesses_image_loading to be False which solved my problem https://github.com/nyu-systems/Grendel-GS/blob/e5fea1e926134918849607509a833dc20828b686/arguments/__init__.py#L170

EhabWilson commented 1 month ago

@TarzanZhao Thanks a lot! It really helps to me.