crack detection: runtime errors with train_model()

DogmaF commented 4 years ago

RE the Jupyter file for the crack detection project: I'm get runtime errors at cell [34], when I try to train the model. It seems to have something to do with signal handling. The last item in the error hierarchy is: RuntimeError: DataLoader worker (pid 83316) is killed by signal: Unknown signal: 0.

To simplify debugging, I tried running it with zero epochs. Here are the error statements generated when I do that.

(I also found that I needed to add a line for import torchsummary, and move %matplotlib inline to the top of the import list to overcome other errors.) This is on a Mac (OSX 10.15.4) with Python 3.7.6 and pytorch 1.4.0

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-34-51af14cba900> in <module>
      1 base_model = train_model(resnet50, criterion, optimizer, exp_lr_scheduler, num_epochs=0)
----> 2 visualize_model(base_model)
      3 plt.show()

<ipython-input-25-8be992550be9> in visualize_model(model, num_images)
      6 
      7     with torch.no_grad():
----> 8         for i, (inputs, labels) in enumerate(dataloaders['val']):
      9             inputs = inputs.to(device)
     10             labels = labels.to(device)

~/opt/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py in __iter__(self)
    277             return _SingleProcessDataLoaderIter(self)
    278         else:
--> 279             return _MultiProcessingDataLoaderIter(self)
    280 
    281     @property

~/opt/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py in __init__(self, loader)
    744         # prime the prefetch loop
    745         for _ in range(2 * self._num_workers):
--> 746             self._try_put_index()
    747 
    748     def _try_get_data(self, timeout=_utils.MP_STATUS_CHECK_INTERVAL):

~/opt/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py in _try_put_index(self)
    870             return
    871 
--> 872         self._index_queues[worker_queue_idx].put((self._send_idx, index))
    873         self._task_info[self._send_idx] = (worker_queue_idx,)
    874         self._tasks_outstanding += 1

~/opt/anaconda3/lib/python3.7/multiprocessing/queues.py in put(self, obj, block, timeout)
     85         with self._notempty:
     86             if self._thread is None:
---> 87                 self._start_thread()
     88             self._buffer.append(obj)
     89             self._notempty.notify()

~/opt/anaconda3/lib/python3.7/multiprocessing/queues.py in _start_thread(self)
    157 
    158         # Start thread which transfers data from buffer to pipe
--> 159         self._buffer.clear()
    160         self._thread = threading.Thread(
    161             target=Queue._feed,

~/opt/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py in handler(signum, frame)
     64         # This following call uses `waitid` with WNOHANG from C side. Therefore,
     65         # Python can still get and update the process status successfully.
---> 66         _error_if_any_worker_fails()
     67         if previous_handler is not None:
     68             previous_handler(signum, frame)

RuntimeError: DataLoader worker (pid 83316) is killed by signal: Unknown signal: 0.

priya-dwivedi commented 4 years ago

The PID killed by signal error is a catch all that just indicates code is not running. Sorry I know it doesn't help. From my previous experience I strongly suspect it can be a dependency issue. Try setting up an envtt with the latest pytorch version and trying again

On Wed, Apr 1, 2020 at 5:11 PM Mike Fuller notifications@github.com wrote:

RE the Jupyter file for the crack detection project: I'm get runtime errors at cell [34], when I try to train the model. It seems to have something to do with signal handling. The last item in the error hierarchy is: RuntimeError: DataLoader worker (pid 83316) is killed by signal: Unknown signal: 0.

To simplify debugging, I tried running it with zero epochs. Here are the error statements generated when I do that.

(I also found that I needed to add a line to import torchsummary, and move %matplotlib inline to the top of the import list to overcome other errors.) This is on a Mac (OSX 10.15.4) with Python 3.7.6 and pytorch 1.4.0

RuntimeError Traceback (most recent call last)
in 1 base_model = train_model(resnet50, criterion, optimizer, exp_lr_scheduler, num_epochs=0) ----> 2 visualize_model(base_model) 3 plt.show() in visualize_model(model, num_images) 6 7 with torch.no_grad(): ----> 8 for i, (inputs, labels) in enumerate(dataloaders['val']): 9 inputs = inputs.to(device) 10 labels = labels.to(device) ~/opt/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py in __iter__(self) 277 return _SingleProcessDataLoaderIter(self) 278 else: --> 279 return _MultiProcessingDataLoaderIter(self) 280 281 @property ~/opt/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py in __init__(self, loader) 744 # prime the prefetch loop 745 for _ in range(2 * self._num_workers): --> 746 self._try_put_index() 747 748 def _try_get_data(self, timeout=_utils.MP_STATUS_CHECK_INTERVAL): ~/opt/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py in _try_put_index(self) 870 return 871 --> 872 self._index_queues[worker_queue_idx].put((self._send_idx, index)) 873 self._task_info[self._send_idx] = (worker_queue_idx,) 874 self._tasks_outstanding += 1 ~/opt/anaconda3/lib/python3.7/multiprocessing/queues.py in put(self, obj, block, timeout) 85 with self._notempty: 86 if self._thread is None: ---> 87 self._start_thread() 88 self._buffer.append(obj) 89 self._notempty.notify() ~/opt/anaconda3/lib/python3.7/multiprocessing/queues.py in _start_thread(self) 157 158 # Start thread which transfers data from buffer to pipe --> 159 self._buffer.clear() 160 self._thread = threading.Thread( 161 target=Queue._feed, ~/opt/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py in handler(signum, frame) 64 # This following call uses `waitid` with WNOHANG from C side. Therefore, 65 # Python can still get and update the process status successfully. ---> 66 _error_if_any_worker_fails() 67 if previous_handler is not None: 68 previous_handler(signum, frame) RuntimeError: DataLoader worker (pid 83316) is killed by signal: Unknown signal: 0. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub , or unsubscribe .

DogmaF commented 4 years ago

Okay, thanks for the quick response! I will give it a try.

priya-dwivedi / Deep-Learning

crack detection: runtime errors with train_model() #104