noahzn / Lite-Mono

[CVPR2023] Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation
MIT License
540 stars 61 forks source link

OSError: [Errno 24] Too many open files #59

Closed vittoria310 closed 1 year ago

vittoria310 commented 1 year ago

Hi,noahzn I am currently facing an issue while using training and encountering the following error: "OSError: [Errno 24] Too many open files". I was wondering if you could kindly provide some guidance on how to resolve this issue.

noahzn commented 1 year ago

Hi, can you copy the complete error log?

vittoria310 commented 1 year ago

Hi, can you copy the complete error log?

Traceback (most recent call last): File "/data/env/lite2/lib/python3.8/multiprocessing/queues.py", line 239, in _feed File "/data/env/lite2/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps File "/data/env/lite2/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 359, in reduce_storage File "/data/env/lite2/lib/python3.8/multiprocessing/reduction.py", line 198, in DupFd File "/data/env/lite2/lib/python3.8/multiprocessing/resource_sharer.py", line 48, in init OSError: [Errno 24] Too many open files

noahzn commented 1 year ago

Please set num_workers to a smaller value.

vittoria310 commented 1 year ago

Please set num_workers to a smaller value.

I tried decreasing num-workers and reducing batchsize, as well as increasing the system thread limit using ulimit -n, but I encountered the following error: Traceback (most recent call last): File "train.py", line 13, in trainer.train() File "/data/project/Lite-Mono-main/trainer.py", line 219, in train self.run_epoch() File "/data/project/Lite-Mono-main/trainer.py", line 234, in run_epoch for batch_idx, inputs in enumerate(self.train_loader): File "/home/data/env/lite2/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 681, in next data = self._next_data() File "/home/data/env/lite2/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1348, in _next_data self._shutdown_workers() File "/home/data/env/lite2/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1455, in _shutdown_workers self._worker_result_queue.put((None, None)) File "/home/data/env/lite2/lib/python3.8/multiprocessing/queues.py", line 88, in put self._start_thread() File "/home/data/env/lite2/lib/python3.8/multiprocessing/queues.py", line 173, in _start_thread self._thread.start() File "/home/data/env/lite2/lib/python3.8/threading.py", line 852, in start _start_new_thread(self._bootstrap, ()) RuntimeError: can't start new thread

noahzn commented 1 year ago

Can you try setting num_workers as 0? It seems that your machine doesn't have enough cpu resources

vittoria310 commented 1 year ago

您可以尝试将 num_workers 设置为 0 吗?看来您的机器没有足够的 cpu 资源

Thank you for your response. I will switch to another GPU for training.