junyanz / pytorch-CycleGAN-and-pix2pix

Image-to-Image Translation in PyTorch
Other
22.8k stars 6.29k forks source link

bad file descriptor after 1 or 2 epoch in ubuntu 22.04 cuda11.7 python 3.9 ,how should I do ? #1563

Open fsqvictor opened 1 year ago

fsqvictor commented 1 year ago

sometimes the error is the first one, the other is the second one, (1) Traceback (most recent call last): File "/home/hllgroup/anaconda3/envs/pixandcycgan/lib/python3.9/multiprocessing/resource_sharer.py", line 147, in _serve close() File "/home/hllgroup/anaconda3/envs/pixandcycgan/lib/python3.9/multiprocessing/connection.py", line 263, in exit self.close() File "/home/hllgroup/anaconda3/envs/pixandcycgan/lib/python3.9/multiprocessing/connection.py", line 177, in close self._close() File "/home/hllgroup/anaconda3/envs/pixandcycgan/lib/python3.9/multiprocessing/connection.py", line 361, in _close _close(self._handle) OSError: [Errno 9] Bad file descriptor Traceback (most recent call last): File "/home/hllgroup/DeepLearning/pytorch-CycleGAN-and-pix2pix/train.py", line 44, in for i, data in enumerate(dataset): # inner loop within one epoch File "/home/hllgroup/DeepLearning/pytorch-CycleGAN-and-pix2pix/data/init.py", line 90, in iter for i, data in enumerate(self.dataloader): File "/home/hllgroup/anaconda3/envs/pixandcycgan/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 634, in next data = self._next_data() File "/home/hllgroup/anaconda3/envs/pixandcycgan/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1329, in _next_data idx, data = self._get_data() File "/home/hllgroup/anaconda3/envs/pixandcycgan/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1295, in _get_data success, data = self._try_get_data() File "/home/hllgroup/anaconda3/envs/pixandcycgan/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data data = self._data_queue.get(timeout=timeout) File "/home/hllgroup/anaconda3/envs/pixandcycgan/lib/python3.9/multiprocessing/queues.py", line 122, in get return _ForkingPickler.loads(res) File "/home/hllgroup/anaconda3/envs/pixandcycgan/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 312, in rebuild_storage_fd storage = cls._new_shared_fd_cpu(fd, size) RuntimeError: unable to resize file to the right size: Invalid argument (22)

(2) Traceback (most recent call last): File "/home/hllgroup/anaconda3/envs/pixandcycgan/lib/python3.9/multiprocessing/resource_sharer.py", line 145, in _serve send(conn, destination_pid) File "/home/hllgroup/anaconda3/envs/pixandcycgan/lib/python3.9/multiprocessing/resource_sharer.py", line 50, in send reduction.send_handle(conn, new_fd, pid) File "/home/hllgroup/anaconda3/envs/pixandcycgan/lib/python3.9/multiprocessing/reduction.py", line 184, in send_handle sendfds(s, [handle]) File "/home/hllgroup/anaconda3/envs/pixandcycgan/lib/python3.9/multiprocessing/reduction.py", line 149, in sendfds sock.sendmsg([msg], [(socket.SOL_SOCKET, socket.SCM_RIGHTS, fds)]) OSError: [Errno 9] Bad file descriptor

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/hllgroup/anaconda3/envs/pixandcycgan/lib/python3.9/multiprocessing/resource_sharer.py", line 147, in _serve close() File "/home/hllgroup/anaconda3/envs/pixandcycgan/lib/python3.9/multiprocessing/resource_sharer.py", line 52, in close os.close(new_fd) OSError: [Errno 9] Bad file descriptor Traceback (most recent call last): File "/home/hllgroup/DeepLearning/pytorch-CycleGAN-and-pix2pix/train.py", line 44, in for i, data in enumerate(dataset): # inner loop within one epoch File "/home/hllgroup/DeepLearning/pytorch-CycleGAN-and-pix2pix/data/init.py", line 90, in iter for i, data in enumerate(self.dataloader): File "/home/hllgroup/anaconda3/envs/pixandcycgan/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 634, in next data = self._next_data() File "/home/hllgroup/anaconda3/envs/pixandcycgan/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1329, in _next_data idx, data = self._get_data() File "/home/hllgroup/anaconda3/envs/pixandcycgan/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1295, in _get_data success, data = self._try_get_data() File "/home/hllgroup/anaconda3/envs/pixandcycgan/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data data = self._data_queue.get(timeout=timeout) File "/home/hllgroup/anaconda3/envs/pixandcycgan/lib/python3.9/multiprocessing/queues.py", line 122, in get return _ForkingPickler.loads(res) File "/home/hllgroup/anaconda3/envs/pixandcycgan/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 307, in rebuild_storage_fd fd = df.detach() File "/home/hllgroup/anaconda3/envs/pixandcycgan/lib/python3.9/multiprocessing/resource_sharer.py", line 58, in detach return reduction.recv_handle(conn) File "/home/hllgroup/anaconda3/envs/pixandcycgan/lib/python3.9/multiprocessing/reduction.py", line 189, in recv_handle return recvfds(s, 1)[0] File "/home/hllgroup/anaconda3/envs/pixandcycgan/lib/python3.9/multiprocessing/reduction.py", line 159, in recvfds raise EOFError EOFError

junyanz commented 1 year ago

it might be related to data loading.

If you use our datasets, have you downloaded the data? If you use your own datasets, you may want to check whether the data files are corrupt or not.

fsqvictor commented 1 year ago

it might be related to data loading.

If you use our datasets, have you downloaded the data? If you use your own datasets, you may want to check whether the data files are corrupt or not.

Thanks. You map's data can run smoothly. I'm not sure whether the data files are corrupt when I transfer datas. Howerver, I found these data can run in windows 11/10 system. Are there some differences between ubuntu and windows?

junyanz commented 1 year ago

Not sure. We only tested the system on Ubuntu.