torchmd / torchmd-net

Training neural network potentials
MIT License
335 stars 75 forks source link

Training crashes after 50 epochs #290

Open peastman opened 9 months ago

peastman commented 9 months ago

My training runs always crash after exactly 50 epochs. Looking at the log, there are many repetitions of this error:

Exception in thread Thread-104 (_pin_memory_loop):
Traceback (most recent call last):
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 54, in _pin_memory_loop
    do_one_step()
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 31, in do_one_step
    r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/site-packages/torch/multiprocessing/reductions.py", line 355, in rebuild_storage_fd
    fd = df.detach()
         ^^^^^^^^^^^
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/multiprocessing/resource_sharer.py", line 58, in detach
    return reduction.recv_handle(conn)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/multiprocessing/reduction.py", line 189, in recv_handle
    return recvfds(s, 1)[0]
           ^^^^^^^^^^^^^
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/multiprocessing/reduction.py", line 164, in recvfds
    raise RuntimeError('received %d items of ancdata' %
RuntimeError: received 0 items of ancdata

and then it finally exits with this error:

  File "/home/peastman/miniconda3/envs/torchmd-net2/bin/torchmd-train", line 33, in <module>
    sys.exit(load_entry_point('torchmd-net', 'console_scripts', 'torchmd-train')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/peastman/workspace/torchmd-net/torchmdnet/scripts/train.py", line 220, in main
    trainer.fit(model, data, ckpt_path=None if args.reset_trainer else args.load_model)
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 102, in launch
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 989, in _run
    results = self._run_stage()
              ^^^^^^^^^^^^^^^^^
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1035, in _run_stage
    self.fit_loop.run()
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/site-packages/lightning/pytorch/loops/fit_loop.py", line 202, in run
    self.advance()
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/site-packages/lightning/pytorch/loops/fit_loop.py", line 359, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 136, in run
    self.advance(data_fetcher)
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 202, in advance
    batch, _, __ = next(data_fetcher)
                   ^^^^^^^^^^^^^^^^^^
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/site-packages/lightning/pytorch/loops/fetchers.py", line 127, in __next__
    batch = super().__next__()
            ^^^^^^^^^^^^^^^^^^
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/site-packages/lightning/pytorch/loops/fetchers.py", line 56, in __next__
    batch = next(self.iterator)
            ^^^^^^^^^^^^^^^^^^^
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/site-packages/lightning/pytorch/utilities/combined_loader.py", line 326, in __next__
    out = next(self._iterator)
          ^^^^^^^^^^^^^^^^^^^^
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/site-packages/lightning/pytorch/utilities/combined_loader.py", line 74, in __next__
    out[i] = next(self.iterators[i])
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1328, in _next_data
    idx, data = self._get_data()
                ^^^^^^^^^^^^^^^^
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1289, in _get_data
    raise RuntimeError('Pin memory thread exited unexpectedly')
RuntimeError: Pin memory thread exited unexpectedly
[E ProcessGroupNCCL.cpp:475] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1127667, OpType=ALLREDUCE, NumelIn=288321, NumelOut=288321, Timeout(ms)=1800000) ran for 1800800 milliseconds before timing out.

Any idea what could be causing it?

RaulPPelaez commented 9 months ago

Some users started seeing similar behavior to this, so I added this workaround to the README:

Some CUDA systems might hang during a multi-GPU parallel training. Try export NCCL_P2P_DISABLE=1, which disables direct peer to peer GPU communication.

Could it be the root of your issue too? I am assuming this i a multigpu training.

I do not remember the error being as consistent as you say (always 50 steps), so it might be unrelated. OTOH the error suggests a relation to pinned memory, which makes me think of this: https://github.com/torchmd/torchmd-net/blob/166b7db8661696f01c4adeb0cb02313c236061a2/torchmdnet/data.py#L132-L139

It would be great if you could try persistent_workers=False and pin_memory=False (separately) and report back.

peastman commented 9 months ago

Thanks! Yes, this is with multiple GPUs. I just started a run with persistent_workers=False. I'll let you know what happens.

peastman commented 9 months ago

Crossing my fingers, but I think persistent_workers=False fixed it. My latest training run is up to 70 epochs without crashing.