junyanz / pytorch-CycleGAN-and-pix2pix

Image-to-Image Translation in PyTorch
Other
22.73k stars 6.28k forks source link

OSError: Caught OSError in DataLoader worker process 3 #1471

Open FlorianRegisBamb opened 2 years ago

FlorianRegisBamb commented 2 years ago

Hello, I was training my model it was working until epoch 148 when I got theses Errors: <<OSError: Caught OSError in DataLoader worker process 3>> <<OSError: [Errno 5] Input/output error>>. I'm training the model on a linux VM.

learning rate 0.0001050 -> 0.0001030 (epoch: 148, iters: 50, time: 5.328, data: 0.004) G_GAN: 1.660 G_L1: 21.545 D_real: 0.006 D_fake: 0.244 G: 23.206 D: 0.125 saving the latest model (epoch 148, total_iters 60000) (epoch: 148, iters: 150, time: 1.322, data: 0.003) G_GAN: 1.076 G_L1: 34.955 D_real: 0.000 D_fake: 0.642 G: 36.031 D: 0.321 (epoch: 148, iters: 250, time: 1.316, data: 0.004) G_GAN: 2.841 G_L1: 17.667 D_real: 0.607 D_fake: 0.061 G: 20.508 D: 0.334 (epoch: 148, iters: 350, time: 1.338, data: 0.004) G_GAN: 1.837 G_L1: 25.288 D_real: 0.050 D_fake: 0.239 G: 27.126 D: 0.144 (epoch: 148, iters: 450, time: 2.624, data: 0.003) G_GAN: 5.915 G_L1: 23.653 D_real: 0.006 D_fake: 0.003 G: 29.568 D: 0.005 (epoch: 148, iters: 550, time: 1.307, data: 0.004) G_GAN: 1.869 G_L1: 35.894 D_real: 0.004 D_fake: 0.292 G: 37.763 D: 0.148 (epoch: 148, iters: 650, time: 1.308, data: 0.003) G_GAN: 1.511 G_L1: 21.548 D_real: 0.095 D_fake: 0.382 G: 23.059 D: 0.238 (epoch: 148, iters: 750, time: 1.338, data: 0.003) G_GAN: 3.447 G_L1: 22.605 D_real: 0.088 D_fake: 0.038 G: 26.052 D: 0.063 (epoch: 148, iters: 850, time: 2.473, data: 0.004) G_GAN: 3.026 G_L1: 22.714 D_real: 0.017 D_fake: 0.063 G: 25.740 D: 0.040

Traceback (most recent call last): File "/home/exxact/Documents/OMEGA/OMEGA_RD_IA/CycleGAN_Pix2Pix/train.py", line 44, in for i, data in enumerate(dataset): # inner loop within one epoch File "/home/exxact/Documents/OMEGA/OMEGA_RD_IA/CycleGAN_Pix2Pix/data/init.py", line 90, in iter for i, data in enumerate(self.dataloader): File "/home/exxact/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 681, in next data = self._next_data() File "/home/exxact/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1376, in _next_data return self._process_data(data) File "/home/exxact/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1402, in _process_data data.reraise() File "/home/exxact/.local/lib/python3.10/site-packages/torch/_utils.py", line 461, in reraise raise exception OSError: Caught OSError in DataLoader worker process 3. Original Traceback (most recent call last): File "/home/exxact/.local/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop data = fetcher.fetch(index) File "/home/exxact/.local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/exxact/.local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 49, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/exxact/Documents/OMEGA/OMEGA_RD_IA/CycleGAN_Pix2Pix/data/aligned_dataset.py", line 45, in getitem A = AB.crop((0, 0, w2, h)) File "/usr/lib/python3/dist-packages/PIL/Image.py", line 1146, in crop self.load() File "/usr/lib/python3/dist-packages/PIL/ImageFile.py", line 235, in load s = read(self.decodermaxblock) File "/usr/lib/python3/dist-packages/PIL/JpegImagePlugin.py", line 402, in load_read s = self.fp.read(read_bytes) OSError: [Errno 5] Input/output error Traceback (most recent call last): File "/home/exxact/Documents/OMEGA/OMEGA_RD_IA/CycleGAN_Pix2Pix/train.py", line 44, in for i, data in enumerate(dataset): # inner loop within one epoch File "/home/exxact/Documents/OMEGA/OMEGA_RD_IA/CycleGAN_Pix2Pix/data/init.py", line 90, in iter for i, data in enumerate(self.dataloader): File "/home/exxact/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 681, in next data = self._next_data() File "/home/exxact/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1376, in _next_data return self._process_data(data) File "/home/exxact/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1402, in _process_data data.reraise() File "/home/exxact/.local/lib/python3.10/site-packages/torch/_utils.py", line 461, in reraise raise exception OSError: Caught OSError in DataLoader worker process 3. Original Traceback (most recent call last): File "/home/exxact/.local/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop data = fetcher.fetch(index) File "/home/exxact/.local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/exxact/.local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 49, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/exxact/Documents/OMEGA/OMEGA_RD_IA/CycleGAN_Pix2Pix/data/aligned_dataset.py", line 45, in getitem A = AB.crop((0, 0, w2, h)) File "/usr/lib/python3/dist-packages/PIL/Image.py", line 1146, in crop

May I ask help to understand where this come from?

junyanz commented 2 years ago

It's hard to know. Maybe this image is corrupt, or this image is too small (smaller than your crop size). Which --preprocess flag did you use?

FlorianRegisBamb commented 2 years ago

I am using "resize_and_crop" as my --preprocess flag. What makes me confused is the fact that it was running until this epoch(148) and this error came from my input wich did not changes during the train processing. Then when I continue the training with --continue_train it works sometimes and the other time the same error come after some epochs

junyanz commented 1 year ago

Sometimes, your cropping might get unlucky. It not only depends on the image sizes, but also depends on where you crop the patches.