jaywalnut310 / vits

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
https://jaywalnut310.github.io/vits-demo/index.html
MIT License
6.48k stars 1.21k forks source link

RuntimeError: DataLoader worker (pid 31433) is killed by signal: Aborted. #163

Open JunhaoHuang0615 opened 11 months ago

JunhaoHuang0615 commented 11 months ago

When i was training, i got this error, but the training was not stopped, any ideas?

I'm afraid it may have some impact on my model training....

Thanks a lot! Traceback: Exception ignored in: <function _MultiProcessingDataLoaderIter.del at 0x7f4ed3cfa430> Traceback (most recent call last): File "/home/kenhuang/.conda/envs/vits/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1466, in del self._shutdown_workers() File "/home/kenhuang/.conda/envs/vits/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1430, in _shutdown_workers w.join(timeout=_utils.MP_STATUS_CHECK_INTERVAL) File "/home/kenhuang/.conda/envs/vits/lib/python3.8/multiprocessing/process.py", line 149, in join res = self._popen.wait(timeout) File "/home/kenhuang/.conda/envs/vits/lib/python3.8/multiprocessing/popen_fork.py", line 44, in wait if not wait([self.sentinel], timeout): File "/home/kenhuang/.conda/envs/vits/lib/python3.8/multiprocessing/connection.py", line 931, in wait ready = selector.select(timeout) File "/home/kenhuang/.conda/envs/vits/lib/python3.8/selectors.py", line 415, in select fd_event_list = self._selector.poll(timeout) File "/home/kenhuang/.conda/envs/vits/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler _error_if_any_worker_fails() RuntimeError: DataLoader worker (pid 31433) is killed by signal: Aborted.

INFO:feng_pai:Saving model and optimizer state at iteration 1 to ../drive/MyDrive/feng_pai/G_0.pth INFO:feng_pai:Saving model and optimizer state at iteration 1 to ../drive/MyDrive/feng_pai/D_0.pth INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration. INFO:feng_pai:Train Epoch: 1 [32%] INFO:feng_pai:[3.3805906772613525, 1.5480570793151855, 1.6330924034118652, 16.685731887817383, 1.5729765892028809, 0.7348751425743103, 200, 0.0002]

heesuju commented 11 months ago

I got the exact same error when training with a batch size of 32 on torch 2.0.1 On other forums, the popular solution was to change the "num_workers" value when initializing the dataloader. I have a feeling this is not the solution, however, as num_workers is set as 8 by default, and lowering this value will only make the training slower, not to mention the fact that VITS is only using 1% of my shared memory during training. Does anyone know if this error affects training or why it happens at all?

heesuju commented 11 months ago

I checked dataloader.py and it seems like function _MultiProcessingDataLoaderIter.del is being called when dataloader iterators are created within a subprocess. From what I can understand, when a process ends it calls del on the iterator. If the dataloader iterator is still referenced in a function the dataloader workers are not notified before they're killed, hence causing this error. Since this problems occurs when memory is being freed after a process ends, it does not seem like it will affect the training process. I've also checked the result after 3000 epochs, and the model doesn't have any problems during inference so far. However, this is just a guess based on my experience and surface-level knowledge. It would be great if anyone could actually confirm this.